Title:
Uncovering Market Trends and Key Drivers of Customer Satisfaction: A Data Analysis of Sephora Products and Reviews
Team Members:
This project focuses on analyzing the Sephora Products and Skincare Reviews Dataset, which contains detailed information on over 8,000 beauty products and 1 million customer reviews. The dataset provides key insights into the cosmetic industry through product attributes, prices, ingredients, and customer feedback.
Our primary objective is to identify factors that drive higher product ratings and understand how ingredients, prices, and product types influence customer satisfaction. Additionally, we aim to explore why certain products underperform by analyzing user reviews. Beyond ratings, we will investigate trends in consumer sentiment over time and assess the impact of movements like "clean beauty" and seasonal variations in feedback.
Using data manipulation, sentiment analysis, and visualization, we intend to uncover patterns in customer behavior, empowering brands to align their strategies with customer needs and industry trends.
The cosmetic and skincare industry is evolving rapidly, with consumers becoming more focused on tailored solutions and sustainable products. Customer reviews are now essential in shaping brand reputations and guiding purchase decisions. This project is motivated by a desire to understand what drives customer satisfaction, allowing businesses to improve product offerings based on consumer needs.
Specifically, we aim to address the following 3 real-world questions:
What factors drive higher product ratings? Why do certain products underperform?
How does customer sentiment change over time?
Are certain ingredients linked to positive or negative customer experiences?
How certain ingredients link to effectiveness for customers with different skin types?
Through this analysis, we hope to generate actionable insights that help brands meet customer expectations, avoid problematic ingredients, and enhance customer experiences.
Our project utilizes two primary datasets, which complement each other by linking product attributes with customer feedback, providing the comprehensive information for our analysis.
This data was collected in March 2023 and provides detailed information on a variety of beauty and skincare products available on Sephora's online store. It includes both product details and customer reviews, offering a comprehensive view of customer feedback and product features. The dataset is publicly available on Kaggle, a popular data platform, making it suitable for exploratory analysis and machine learning tasks.
URL:
Description:
Product Information Dataset (product_info.csv)
Customer Reviews Dataset
review_0-250.csv, review_250-500.csv, review_500-750.csv, review_750-1250.csv, review_1250-end.csv. The product information dataset provides insight into product attributes such as ingredients, pricing and etc., while the customer reviews dataset offers feedback on how these products perform from the customer’s perspective. Together, these datasets offer a comprehensive view of the relationship between product attributes and consumer experiences, enabling us to uncover actionable insights for brands in the cosmetic industry.
It is worth noting that the customer reviews dataset covers reviews on only 2,000 skincare products, while the product information dataset includes details on 8,000+ products across various categories.
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud
Firstly, we load the data from existing csv file and get an idea of its shape as well as fields included.
product_df = pd.read_csv("Sephora/product_info.csv")
print("product info dataframe shape: ", product_df.shape)
product_df.head()
product info dataframe shape: (8494, 27)
| product_id | product_name | brand_id | brand_name | loves_count | rating | reviews | size | variation_type | variation_value | variation_desc | ingredients | price_usd | value_price_usd | sale_price_usd | limited_edition | new | online_only | out_of_stock | sephora_exclusive | highlights | primary_category | secondary_category | tertiary_category | child_count | child_max_price | child_min_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P473671 | Fragrance Discovery Set | 6342 | 19-69 | 6320 | 3.6364 | 11.0 | NaN | NaN | NaN | NaN | ['Capri Eau de Parfum:', 'Alcohol Denat. (SD A... | 35.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Warm &Spicy Scen... | Fragrance | Value & Gift Sets | Perfume Gift Sets | 0 | NaN | NaN |
| 1 | P473668 | La Habana Eau de Parfum | 6342 | 19-69 | 3827 | 4.1538 | 13.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 85.0 | 30.0 |
| 2 | P473662 | Rainbow Bar Eau de Parfum | 6342 | 19-69 | 3253 | 4.2500 | 16.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 75.0 | 30.0 |
| 3 | P473660 | Kasbah Eau de Parfum | 6342 | 19-69 | 3018 | 4.4762 | 21.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 75.0 | 30.0 |
| 4 | P473658 | Purple Haze Eau de Parfum | 6342 | 19-69 | 2691 | 3.2308 | 13.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 75.0 | 30.0 |
To get an idea of the general information of the dataset both intuitively and statistically, we use .info() and .describe() correspondingly to have basic information result, and the outcome is displayed as follows.
print("Basic Information:")
product_df.info()
Basic Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 8494 entries, 0 to 8493 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 product_id 8494 non-null object 1 product_name 8494 non-null object 2 brand_id 8494 non-null int64 3 brand_name 8494 non-null object 4 loves_count 8494 non-null int64 5 rating 8216 non-null float64 6 reviews 8216 non-null float64 7 size 6863 non-null object 8 variation_type 7050 non-null object 9 variation_value 6896 non-null object 10 variation_desc 1250 non-null object 11 ingredients 7549 non-null object 12 price_usd 8494 non-null float64 13 value_price_usd 451 non-null float64 14 sale_price_usd 270 non-null float64 15 limited_edition 8494 non-null int64 16 new 8494 non-null int64 17 online_only 8494 non-null int64 18 out_of_stock 8494 non-null int64 19 sephora_exclusive 8494 non-null int64 20 highlights 6287 non-null object 21 primary_category 8494 non-null object 22 secondary_category 8486 non-null object 23 tertiary_category 7504 non-null object 24 child_count 8494 non-null int64 25 child_max_price 2754 non-null float64 26 child_min_price 2754 non-null float64 dtypes: float64(7), int64(8), object(12) memory usage: 1.7+ MB
print("\nSummary Statistics (Numerical):")
product_df.describe()
Summary Statistics (Numerical):
| brand_id | loves_count | rating | reviews | price_usd | value_price_usd | sale_price_usd | limited_edition | new | online_only | out_of_stock | sephora_exclusive | child_count | child_max_price | child_min_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8494.000000 | 8.494000e+03 | 8216.000000 | 8216.000000 | 8494.000000 | 451.000000 | 270.000000 | 8494.000000 | 8494.000000 | 8494.000000 | 8494.000000 | 8494.000000 | 8494.000000 | 2754.000000 | 2754.000000 |
| mean | 5422.440546 | 2.917957e+04 | 4.194513 | 448.545521 | 51.655595 | 91.168537 | 20.207889 | 0.070285 | 0.071698 | 0.219096 | 0.073699 | 0.279374 | 1.631622 | 53.792023 | 39.665802 |
| std | 1709.595957 | 6.609212e+04 | 0.516694 | 1101.982529 | 53.669234 | 79.195631 | 24.327352 | 0.255642 | 0.258002 | 0.413658 | 0.261296 | 0.448718 | 5.379470 | 58.765894 | 38.685720 |
| min | 1063.000000 | 0.000000e+00 | 1.000000 | 1.000000 | 3.000000 | 0.000000 | 1.750000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 3.000000 |
| 25% | 5333.000000 | 3.758000e+03 | 3.981725 | 26.000000 | 25.000000 | 45.000000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 22.000000 | 19.000000 |
| 50% | 6157.500000 | 9.880000e+03 | 4.289350 | 122.000000 | 35.000000 | 67.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 32.000000 | 28.000000 |
| 75% | 6328.000000 | 2.684125e+04 | 4.530525 | 418.000000 | 58.000000 | 108.500000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 59.000000 | 42.000000 |
| max | 8020.000000 | 1.401068e+06 | 5.000000 | 21281.000000 | 1900.000000 | 617.000000 | 320.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 105.000000 | 570.000000 | 400.000000 |
We load data of customer review into dataframes and concat them together into one complete dataframe altogether. after that, we apply similar analysis methods to have an intuitive overview. The results are displayed as follows.
# Reference: https://www.kaggle.com/code/themeeemul/sephora-eda-and-sentiment-analysis-using-pytorch
review_df_1 = pd.read_csv("Sephora/reviews_0-250.csv",index_col = 0, dtype={'author_id':'str'})
review_df_2 = pd.read_csv("Sephora/reviews_250-500.csv",index_col = 0, dtype={'author_id':'str'})
review_df_3 = pd.read_csv("Sephora/reviews_500-750.csv",index_col = 0, dtype={'author_id':'str'})
review_df_4 = pd.read_csv("Sephora/reviews_750-1250.csv",index_col = 0, dtype={'author_id':'str'})
review_df_5 = pd.read_csv("Sephora/reviews_1250-end.csv",index_col = 0, dtype={'author_id':'str'})
# Merge review_df_1 till review_df_6
review_df = pd.concat([review_df_1, review_df_2, review_df_3, review_df_4, review_df_5],axis=0)
print("review_df shape: ",review_df.shape)
review_df shape: (1094411, 18)
review_df.sample(5)
| author_id | rating | is_recommended | helpfulness | total_feedback_count | total_neg_feedback_count | total_pos_feedback_count | submission_time | review_text | review_title | skin_tone | eye_color | skin_type | hair_color | product_id | product_name | brand_name | price_usd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 318520 | 2506218805 | 5 | 1.0 | NaN | 0 | 0 | 0 | 2020-01-22 | This product really surprised me!! Being on Ac... | So helpful with Accutane!! | light | blue | combination | blonde | P442840 | Barrier+ Triple Lipid-Peptide Face Cream | Skinfix | 54.0 |
| 47147 | 9153021874 | 5 | 1.0 | 1.0 | 2 | 0 | 2 | 2020-07-07 | This is unlike any micellar water I’ve tried b... | Texturally amazing | fair | hazel | combination | brown | P462347 | E-Rase Milki Micellar Water Mini | Drunk Elephant | 14.0 |
| 184012 | 1476710027 | 5 | 1.0 | NaN | 0 | 0 | 0 | 2020-08-10 | I’ve been using this product for a couple week... | So Far, So Good | fair | brown | normal | auburn | P460779 | Skinlongevity Long Life Herb Anti-Aging Face S... | bareMinerals | 65.0 |
| 167906 | 1419803262 | 3 | 1.0 | 1.0 | 2 | 0 | 2 | 2020-08-29 | I tried this out for a whole week to see how t... | Good night moisturizer | medium | brown | oily | brown | P461213 | Honey Balm Niacinamide Moisturizer | Wishful | 43.0 |
| 179021 | 2489687625 | 5 | 1.0 | 1.0 | 2 | 0 | 2 | 2022-07-01 | Love this product for summer and lighter makeu... | NaN | tan | brown | combination | brown | P500245 | C Beyond Triple Serum SPF 40 Mineral Sunscreen... | ILIA | 64.0 |
print("Basic Information:")
review_df.info()
Basic Information: <class 'pandas.core.frame.DataFrame'> Index: 1094411 entries, 0 to 49976 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 author_id 1094411 non-null object 1 rating 1094411 non-null int64 2 is_recommended 926423 non-null float64 3 helpfulness 532819 non-null float64 4 total_feedback_count 1094411 non-null int64 5 total_neg_feedback_count 1094411 non-null int64 6 total_pos_feedback_count 1094411 non-null int64 7 submission_time 1094411 non-null object 8 review_text 1092967 non-null object 9 review_title 783757 non-null object 10 skin_tone 923872 non-null object 11 eye_color 884783 non-null object 12 skin_type 982854 non-null object 13 hair_color 867643 non-null object 14 product_id 1094411 non-null object 15 product_name 1094411 non-null object 16 brand_name 1094411 non-null object 17 price_usd 1094411 non-null float64 dtypes: float64(3), int64(4), object(11) memory usage: 158.6+ MB
print("\nSummary Statistics (Numerical):")
review_df.describe()
Summary Statistics (Numerical):
| rating | is_recommended | helpfulness | total_feedback_count | total_neg_feedback_count | total_pos_feedback_count | price_usd | |
|---|---|---|---|---|---|---|---|
| count | 1.094411e+06 | 926423.000000 | 532819.000000 | 1.094411e+06 | 1.094411e+06 | 1.094411e+06 | 1.094411e+06 |
| mean | 4.299158e+00 | 0.839962 | 0.767782 | 4.177126e+00 | 8.948695e-01 | 3.282257e+00 | 4.900838e+01 |
| std | 1.149444e+00 | 0.366642 | 0.317164 | 2.271524e+01 | 5.288943e+00 | 1.967482e+01 | 4.004338e+01 |
| min | 1.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 |
| 25% | 4.000000e+00 | 1.000000 | 0.652174 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.500000e+01 |
| 50% | 5.000000e+00 | 1.000000 | 0.928571 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.900000e+01 |
| 75% | 5.000000e+00 | 1.000000 | 1.000000 | 3.000000e+00 | 1.000000e+00 | 3.000000e+00 | 6.200000e+01 |
| max | 5.000000e+00 | 1.000000 | 1.000000 | 5.464000e+03 | 1.159000e+03 | 5.050000e+03 | 1.900000e+03 |
Note: The missing value part will be introduced in section Data Cleaning.
After the first step of importing, merging and describing data, we go a step further into data manipulation, to process on the data and turn it into a more tidy and easy-to-use format. Alongside, we conduct data visualization to provide us a straightforward display of all kinds of data distribution, so as to give us a clearer sense of the data. Aspects include customers' rating, item price, size, etc.
To see the overall reputation condition of products, we visualized the distribution of customers' rating on all products. Shown below we can see that large percentage of customers give a rating between 4.5-5.0/5.0, indicating that they're satisfied with the products.
# Customer Rating Distribution
plt.figure(figsize=(8, 5))
review_df['rating'].hist(bins=8, color='lightcoral', edgecolor='black')
plt.title('Customer Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()
To see the price distribution of all products, we create the visualization below. To avoid outliers, we will only visualize products with price <= 350 USD.
# Price Distribution
# To avoid outliers, we will only visualize products with price <= 350 USD
plt.figure(figsize=(8, 5))
product_df[product_df['price_usd'] <= 350]['price_usd'].hist(bins=30, color='gold', edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price (USD)')
plt.ylabel('Frequency')
plt.show()
To see the size of the products, we extract numerical values from certain columns and conduct some datatype conversion to get get the size in both oz and ml.
product_df['size']
0 NaN
1 3.4 oz/ 100 mL
2 3.4 oz/ 100 mL
3 3.4 oz/ 100 mL
4 3.4 oz/ 100 mL
...
8489 NaN
8490 2 oz / 60 mL
8491 NaN
8492 NaN
8493 .11 oz / 3.2 mL
Name: size, Length: 8494, dtype: object
product_df['size_oz'] = product_df['size'].str.extract(r'(\d+\.?\d*)\s*oz', expand=False).astype(float)
print(product_df[['size', 'size_oz']].sample(10))
size size_oz 1294 NaN NaN 4859 6 oz/ 177 mL 6.00 6457 NaN NaN 414 1.7 oz/ 50 mL 1.70 5919 NaN NaN 56 16 oz/ 453 g 16.00 4917 0.12 oz/ 3.4 g 0.12 7706 NaN NaN 350 1.7 oz/ 50 mL 1.70 2747 6.5 oz/ 192 mL 6.50
product_df['size_ml'] = product_df['size'].str.extract(r'(\d+\.?\d*)\s*mL', expand=False)
product_df['size_ml'] = product_df['size_ml'].astype(float)
print(product_df[['size', 'size_ml']].sample(10))
size size_ml 5067 1.7 oz/ 50 mL 50.0 8248 2 oz/ 60 mL 60.0 6149 NaN NaN 538 NaN NaN 6929 1.7 oz/ 50 mL 50.0 6306 6.8 oz/ 200 mL 200.0 4927 NaN NaN 2355 7.0 oz/ 198 g NaN 7556 0.33 oz/ 11 mL 11.0 1114 1.6 oz/ 50 mL 50.0
To get an idea of the price distribution of products, we conduct some calculation and visualization to see the distribution. The results are show below.
product_df['price_per_oz'] = product_df['price_usd'] / product_df['size_oz']
product_df['price_per_ml'] = product_df['price_usd'] / product_df['size_ml']
product_df[['price_usd', 'size_oz', 'price_per_oz', 'size_ml', 'price_per_ml']].sample(10)
| price_usd | size_oz | price_per_oz | size_ml | price_per_ml | |
|---|---|---|---|---|---|
| 2114 | 67.0 | 1.70 | 39.411765 | 50.0 | 1.34 |
| 5667 | 22.0 | 1.70 | 12.941176 | 50.0 | 0.44 |
| 1264 | 38.0 | NaN | NaN | NaN | NaN |
| 5186 | 71.0 | 1.70 | 41.764706 | 50.0 | 1.42 |
| 1368 | 29.0 | 0.34 | 85.294118 | 10.0 | 2.90 |
| 8427 | 20.0 | NaN | NaN | NaN | NaN |
| 2079 | 150.0 | 1.69 | 88.757396 | 50.0 | 3.00 |
| 3419 | 27.0 | 0.42 | 64.285714 | 12.5 | 2.16 |
| 5513 | 29.0 | NaN | NaN | NaN | NaN |
| 7427 | 22.0 | NaN | NaN | NaN | NaN |
# Price Distribution
# To avoid outliers, we will only visualize products with price <= 350 USD
plt.figure(figsize=(8, 5))
product_df[product_df['price_per_oz'] <= 500]['price_per_oz'].hist(bins=30, color='gold', edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price (USD)')
plt.ylabel('Frequency')
plt.show()
To have an idea of how many items belong to each category, we conduct counting on both primary and secondary categories.
product_df['primary_category'].value_counts()
primary_category Skincare 2420 Makeup 2369 Hair 1464 Fragrance 1432 Bath & Body 405 Mini Size 288 Men 60 Tools & Brushes 52 Gifts 4 Name: count, dtype: int64
print(product_df['secondary_category'].shape)
product_df['secondary_category'].value_counts()
(8494,)
secondary_category Women 875 Hair Styling & Treatments 757 Eye 711 Face 659 Moisturizers 551 Value & Gift Sets 498 Treatments 466 Shampoo & Conditioner 431 Lip 411 Cleansers 361 Candles & Home Scents 263 Brushes & Applicators 246 Body Moisturizers 220 Mini Size 187 Eye Care 186 Masks 166 Cheek 165 Tools 153 Makeup 137 Men 135 Sunscreen 109 Skincare 98 Bath & Shower 84 High Tech Tools 80 Wellness 79 Body Care 69 Self Tanners 64 Lip Balms & Treatments 61 Hair 59 Nail 52 Accessories 45 Beauty Tools 23 Makeup Palettes 20 Fragrance 15 Shaving 15 Hair Tools 11 Bath & Body 7 Shop by Concern 5 Beauty Accessories 5 Other Needs 5 Beauty Supplements 2 Name: count, dtype: int64
# Price Distribution
# To avoid outliers, we will only visualize products with price <= 350 USD
plt.figure(figsize=(8, 5))
product_df[product_df['loves_count'] < 400000]['loves_count'].hist(bins=30, color='gold', edgecolor='black')
plt.title('loves Count Distribution')
plt.xlabel('Count')
plt.ylabel('Frequency')
plt.show()
Also we calculate and visualize on the positive and negative feedback percentages of customers, as another indication of their satisfactory level.
review_df['pos_feedback_percentage'] = review_df['total_pos_feedback_count']/review_df['total_feedback_count']
review_df['neg_feedback_percentage'] = review_df['total_neg_feedback_count']/review_df['total_feedback_count']
review_df[['total_pos_feedback_count', 'total_feedback_count', 'pos_feedback_percentage', 'total_neg_feedback_count', 'neg_feedback_percentage']].sample(5)
| total_pos_feedback_count | total_feedback_count | pos_feedback_percentage | total_neg_feedback_count | neg_feedback_percentage | |
|---|---|---|---|---|---|
| 452991 | 0 | 0 | NaN | 0 | NaN |
| 11054 | 0 | 1 | 0.000000 | 1 | 1.000000 |
| 115964 | 1 | 3 | 0.333333 | 2 | 0.666667 |
| 89870 | 0 | 6 | 0.000000 | 6 | 1.000000 |
| 117871 | 0 | 0 | NaN | 0 | NaN |
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Feedback Percentage Distributions')
axes[0].hist(review_df['pos_feedback_percentage'], bins=30, color='green', edgecolor='black')
axes[0].set_title('Positive Feedback Percentage')
axes[0].set_xlabel('Percentage')
axes[0].set_ylabel('Number of Products')
axes[1].hist(review_df['neg_feedback_percentage'], bins=30, color='red', edgecolor='black')
axes[1].set_title('Negative Feedback Percentage')
axes[1].set_xlabel('Percentage')
axes[1].set_ylabel('Number of Products')
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
To see the ingredients contained in the products, we processed the strings in the ingredients column to extract certain fields, thus making some inferences with the help of visualization tools.
product_df['ingredients'] = product_df['ingredients'].str.strip("[]").str.replace("'", "").str.split(", ")
all_ingredients = product_df['ingredients'].explode().dropna().tolist()
ingredient_counts = Counter(all_ingredients)
ingredient_df = pd.DataFrame(ingredient_counts.items(), columns=['Ingredient', 'Count']).sort_values(by='Count', ascending=False)
ingredient_df.head(5)
| Ingredient | Count | |
|---|---|---|
| 191 | Glycerin | 4171 |
| 212 | Phenoxyethanol | 3862 |
| 211 | Caprylyl Glycol | 2747 |
| 321 | Tocopherol | 2681 |
| 20 | Limonene | 2537 |
top20_ingredients = [ingredient for ingredient, _ in ingredient_counts.most_common(20)]
# Add column in product_df for each of the top 20 ingredients for further analysis
for ingredient in top20_ingredients:
column_name = f'contains_{ingredient}'
product_df[column_name] = product_df['ingredients'].apply(lambda x: ingredient in x if isinstance(x, list) else False)
columns_to_display = ['product_id'] + [f'contains_{ingredient}' for ingredient in top20_ingredients]
product_df[columns_to_display].head()
| product_id | contains_Glycerin | contains_Phenoxyethanol | contains_Caprylyl Glycol | contains_Tocopherol | contains_Limonene | contains_Ethylhexylglycerin | contains_Dimethicone | contains_Linalool | contains_Silica | contains_Butylene Glycol | contains_Citric Acid | contains_Mica | contains_Potassium Sorbate | contains_Water | contains_Caprylic/Capric Triglyceride | contains_Sodium Benzoate | contains_Citronellol | contains_Tocopheryl Acetate | contains_Propanediol | contains_Geraniol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P473671 | False | False | False | False | True | False | False | True | False | False | False | False | False | False | False | False | True | False | False | True |
| 1 | P473668 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | P473662 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False |
| 3 | P473660 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | True |
| 4 | P473658 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True |
plt.figure(figsize=(12, 6))
ingredient_df.head(20).plot(kind='barh', x='Ingredient', y='Count', legend=False, color='skyblue')
plt.title('Top 20 Most Common Ingredients in Products')
plt.xlabel('Count')
plt.ylabel('Ingredient')
plt.gca().invert_yaxis()
plt.show()
<Figure size 1200x600 with 0 Axes>
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(ingredient_counts)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Ingredients')
plt.show()
After all of the work and analysis above, we then merge the two datasets together for more convenient future work and analysis. The process is shown below.
# Reference: https://www.kaggle.com/code/themeeemul/sephora-eda-and-sentiment-analysis-using-pytorch
# Merge product_df and review_df
cols_to_use = product_df.columns.difference(review_df.columns)
cols_to_use = list(cols_to_use)
cols_to_use.append('product_id')
Sephora_df = pd.merge(review_df, product_df[cols_to_use], how='outer', on=['product_id', 'product_id'])
print("Sephora Shape: ", Sephora_df.shape)
Sephora Shape: (1100554, 66)
Sephora_df.sample(5)
| author_id | rating | is_recommended | helpfulness | total_feedback_count | total_neg_feedback_count | total_pos_feedback_count | submission_time | review_text | review_title | skin_tone | eye_color | skin_type | hair_color | product_id | product_name | brand_name | price_usd | pos_feedback_percentage | neg_feedback_percentage | brand_id | child_count | child_max_price | child_min_price | contains_Butylene Glycol | contains_Caprylic/Capric Triglyceride | contains_Caprylyl Glycol | contains_Citric Acid | contains_Citronellol | contains_Dimethicone | contains_Ethylhexylglycerin | contains_Geraniol | contains_Glycerin | contains_Limonene | contains_Linalool | contains_Mica | contains_Phenoxyethanol | contains_Potassium Sorbate | contains_Propanediol | contains_Silica | contains_Sodium Benzoate | contains_Tocopherol | contains_Tocopheryl Acetate | contains_Water | highlights | ingredients | limited_edition | loves_count | new | online_only | out_of_stock | price_per_ml | price_per_oz | primary_category | reviews | sale_price_usd | secondary_category | sephora_exclusive | size | size_ml | size_oz | tertiary_category | value_price_usd | variation_desc | variation_type | variation_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 407789 | 11958462128 | 5.0 | 1.0 | 1.0 | 7.0 | 0.0 | 7.0 | 2020-01-13 | I LOVE this product! After seeing reviews for ... | Little miracle bottle is what it is.... | light | hazel | combination | red | P427416 | Caffeine 5% + EGCG Depuffing Eye Serum | The Ordinary | 8.90 | 1.0 | 0.0 | 6234 | 0 | NaN | NaN | False | False | False | False | False | False | True | False | True | False | False | False | True | False | True | False | False | False | False | False | ['Vegan', 'Community Favorite', 'Good for: Dar... | [Aqua (Water), Caffeine, Maltodextrin, Glyceri... | 0 | 281928 | 0 | 0 | 0 | 0.296667 | 8.900000 | Skincare | 2118.0 | NaN | Treatments | 0 | 1 oz/ 30 mL | 30.0 | 1.0 | Face Serums | NaN | NaN | Size | 1 oz/ 30 mL |
| 932672 | 2077003084 | 5.0 | 1.0 | NaN | 0.0 | 0.0 | 0.0 | 2022-07-02 | The best! It is cheap and works amazing. I lit... | NaN | lightMedium | NaN | dry | NaN | P478030 | Mini Oat Cleansing Balm | The INKEY List | 5.99 | NaN | NaN | 6285 | 0 | NaN | NaN | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | True | False | False | ['allure 2020 Best of Beauty Award Winner', 'B... | [Prunus Amygdalus Dulcis (Sweet Almond) Oil, C... | 0 | 40474 | 0 | 0 | 0 | 0.119800 | 3.523529 | Skincare | 2968.0 | NaN | Cleansers | 1 | 1.7 oz/ 50 mL | 50.0 | 1.7 | Face Wash & Cleansers | NaN | NaN | Size | 1.7 oz/ 50 mL |
| 69763 | 1325856411 | 5.0 | 1.0 | NaN | 0.0 | 0.0 | 0.0 | 2019-11-12 | This is the best moisturizer I’ve ever used!!!... | AMAZING!!!!! | lightMedium | brown | combination | black | P248407 | Ultra Repair Cream Intense Hydration | First Aid Beauty | 38.00 | NaN | NaN | 5972 | 3 | 48.0 | 18.0 | False | True | True | False | False | True | False | False | True | False | False | False | True | False | False | False | False | False | False | True | ['Best for Dry Skin', 'Community Favorite', 'C... | [Colloidal Oatmeal 0.50%, Water, Stearic Acid,... | 0 | 300432 | 0 | 0 | 0 | NaN | 6.333333 | Skincare | 7539.0 | NaN | Moisturizers | 0 | 6 oz/ 170 g | NaN | 6.0 | Moisturizers | NaN | NaN | Size | 6 oz/ 170 g |
| 893051 | 31136105305 | 5.0 | 1.0 | NaN | 0.0 | 0.0 | 0.0 | 2021-07-08 | I received a sample of this and I am very impr... | NaN | fair | NaN | normal | NaN | P473820 | Advanced Night Repair Eye Concentrate Matrix S... | Estée Lauder | 82.00 | NaN | NaN | 6089 | 0 | NaN | NaN | True | False | False | False | False | True | True | False | True | False | False | False | True | True | False | False | False | False | True | False | ['Good for: Loss of firmness', 'Good for: Dark... | [Water/Aqua/Eau, Dimethicone, Bifida Ferment L... | 0 | 5480 | 0 | 1 | 0 | 5.466667 | 164.000000 | Skincare | 1096.0 | NaN | Treatments | 0 | 0.5 oz/ 15 mL | 15.0 | 0.5 | Face Serums | NaN | NaN | NaN | NaN |
| 306974 | 8487209118 | 5.0 | 1.0 | NaN | 0.0 | 0.0 | 0.0 | 2018-05-08 | I love all of the products from Sunday Riley. ... | Sunday Riley is the best in the business. | fair | green | normal | brown | P418629 | C.E.O. Vitamin C Brightening Rich Hydration Mo... | Sunday Riley | 22.00 | NaN | NaN | 6073 | 1 | 65.0 | 65.0 | True | True | True | True | False | False | False | False | True | True | True | False | True | False | False | False | False | True | False | True | ['Good for: Dullness/Uneven Texture', 'Vitamin... | [Water, Squalane, Dicaprylyl Carbonate, Tetrah... | 0 | 5905 | 0 | 0 | 0 | NaN | 44.000000 | Skincare | 1101.0 | NaN | Moisturizers | 0 | 0.5 oz/ 15 g | NaN | 0.5 | Moisturizers | NaN | NaN | Size | 0.5 oz/ 15 g |
Sephora_df.describe()
| rating | is_recommended | helpfulness | total_feedback_count | total_neg_feedback_count | total_pos_feedback_count | price_usd | pos_feedback_percentage | neg_feedback_percentage | brand_id | child_count | child_max_price | child_min_price | limited_edition | loves_count | new | online_only | out_of_stock | price_per_ml | price_per_oz | reviews | sale_price_usd | sephora_exclusive | size_ml | size_oz | value_price_usd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.094411e+06 | 926423.000000 | 532819.000000 | 1.094411e+06 | 1.094411e+06 | 1.094411e+06 | 1.094411e+06 | 532819.000000 | 532819.000000 | 1.100554e+06 | 1.100554e+06 | 455758.000000 | 455758.000000 | 1.100554e+06 | 1.100554e+06 | 1.100554e+06 | 1.100554e+06 | 1.100554e+06 | 854407.000000 | 992297.000000 | 1.100276e+06 | 9978.000000 | 1.100554e+06 | 854407.000000 | 992297.000000 | 31189.000000 |
| mean | 4.299158e+00 | 0.839962 | 0.767782 | 4.177126e+00 | 8.948695e-01 | 3.282257e+00 | 4.900838e+01 | 0.767782 | 0.232218 | 5.547432e+03 | 7.803261e-01 | 58.249297 | 30.317296 | 1.715227e-02 | 1.064307e+05 | 2.680832e-02 | 1.106088e-01 | 3.182942e-02 | 1.345774 | 40.577942 | 2.151789e+03 | 16.888082 | 3.239396e-01 | 67.240317 | 3.073330 | 114.830038 |
| std | 1.149444e+00 | 0.366642 | 0.317164 | 2.271524e+01 | 5.288943e+00 | 1.967482e+01 | 4.004338e+01 | 0.317164 | 0.317164 | 1.493821e+03 | 1.344017e+00 | 62.039099 | 28.813499 | 1.298387e-01 | 1.674201e+05 | 1.615229e-01 | 3.136473e-01 | 1.755458e-01 | 1.571010 | 141.789132 | 2.524573e+03 | 13.961803 | 4.679775e-01 | 55.643145 | 7.784536 | 62.313042 |
| min | 1.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 0.000000 | 0.000000 | 1.063000e+03 | 0.000000e+00 | 3.000000 | 3.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.020833 | 0.130682 | 1.000000e+00 | 1.750000 | 0.000000e+00 | 0.160000 | 0.000200 | 0.000000 |
| 25% | 4.000000e+00 | 1.000000 | 0.652174 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.500000e+01 | 0.652174 | 0.000000 | 5.626000e+03 | 0.000000e+00 | 21.000000 | 17.000000 | 0.000000e+00 | 1.630000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.333333 | 9.990000 | 5.390000e+02 | 7.000000 | 0.000000e+00 | 30.000000 | 1.000000 | 68.000000 |
| 50% | 5.000000e+00 | 1.000000 | 0.928571 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.900000e+01 | 0.928571 | 0.071429 | 6.073000e+03 | 0.000000e+00 | 40.000000 | 22.000000 | 0.000000e+00 | 4.814300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.950000 | 26.470588 | 1.337000e+03 | 18.000000 | 0.000000e+00 | 50.000000 | 1.700000 | 102.000000 |
| 75% | 5.000000e+00 | 1.000000 | 1.000000 | 3.000000e+00 | 1.000000e+00 | 3.000000e+00 | 6.200000e+01 | 1.000000 | 0.347826 | 6.234000e+03 | 1.000000e+00 | 71.000000 | 30.000000 | 0.000000e+00 | 1.234390e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.866667 | 50.000000 | 2.891000e+03 | 20.000000 | 1.000000e+00 | 95.000000 | 3.380000 | 142.000000 |
| max | 5.000000e+00 | 1.000000 | 1.000000 | 5.464000e+03 | 1.159000e+03 | 5.050000e+03 | 1.900000e+03 | 1.000000 | 1.000000 | 8.020000e+03 | 1.050000e+02 | 570.000000 | 400.000000 | 1.000000e+00 | 1.401068e+06 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 131.250000 | 110000.000000 | 2.128100e+04 | 320.000000 | 1.000000e+00 | 1000.000000 | 176.000000 | 617.000000 |
Finally, to deal with potential NaN values in the data of all kinds, we conduct data cleaning approaches shown as follows.
# 1. Check for missing values
print("Missing values per column:")
print(Sephora_df.isnull().sum())
Missing values per column:
author_id 6143
rating 6143
is_recommended 174131
helpfulness 567735
total_feedback_count 6143
...
tertiary_category 161894
value_price_usd 1069365
variation_desc 1091034
variation_type 52560
variation_value 64644
Length: 66, dtype: int64
# 2. Deal with missing values
missing_counts = Sephora_df.isna().sum()
threshold = 1
columns_to_drop = missing_counts[missing_counts > threshold].index
for col in columns_to_drop:
print(f"{col}: {missing_counts[col]} missing values")
author_id: 6143 missing values rating: 6143 missing values is_recommended: 174131 missing values helpfulness: 567735 missing values total_feedback_count: 6143 missing values total_neg_feedback_count: 6143 missing values total_pos_feedback_count: 6143 missing values submission_time: 6143 missing values review_text: 7587 missing values review_title: 316797 missing values skin_tone: 176682 missing values eye_color: 215771 missing values skin_type: 117700 missing values hair_color: 232911 missing values product_name: 6143 missing values brand_name: 6143 missing values price_usd: 6143 missing values pos_feedback_percentage: 567735 missing values neg_feedback_percentage: 567735 missing values child_max_price: 644796 missing values child_min_price: 644796 missing values highlights: 115729 missing values ingredients: 22843 missing values price_per_ml: 246147 missing values price_per_oz: 108257 missing values reviews: 278 missing values sale_price_usd: 1090576 missing values secondary_category: 8 missing values size: 44661 missing values size_ml: 246147 missing values size_oz: 108257 missing values tertiary_category: 161894 missing values value_price_usd: 1069365 missing values variation_desc: 1091034 missing values variation_type: 52560 missing values variation_value: 64644 missing values
Considering product name the unique identifier without which analysis becomes meaningless, we choose to drop the rows where product_name is missing in the first place.
# Drop rows with no product information (NaN in column 'product_name')
Sephora_df = Sephora_df.dropna(subset=['product_name'])
After that, we fill up the NaN values following certain rules correspondingly. The details are displayed below in the code block.
# 1. Handle feedback-related columns
Sephora_df = Sephora_df.assign(
helpfulness=Sephora_df['helpfulness'].fillna(0),
pos_feedback_percentage=Sephora_df['pos_feedback_percentage'].fillna(1),
neg_feedback_percentage=Sephora_df['neg_feedback_percentage'].fillna(0),
is_recommended=Sephora_df['is_recommended'].fillna(1)
)
# 2. Handle price-related columns by filling with price_usd
Sephora_df = Sephora_df.assign(
child_max_price=Sephora_df['child_max_price'].fillna(Sephora_df['price_usd']),
child_min_price=Sephora_df['child_min_price'].fillna(Sephora_df['price_usd']),
sale_price_usd=Sephora_df['sale_price_usd'].fillna(Sephora_df['price_usd']),
value_price_usd=Sephora_df['value_price_usd'].fillna(Sephora_df['price_usd'])
)
# 3. Handle variation-related columns
Sephora_df = Sephora_df.assign(
variation_desc=Sephora_df['variation_desc'].fillna('No variation'),
variation_type=Sephora_df['variation_type'].fillna('Unknown'),
variation_value=Sephora_df['variation_value'].fillna('Unknown')
)
# 4. Handle product and review metadata
Sephora_df = Sephora_df.assign(
review_text=Sephora_df['review_text'].fillna('No review provided'),
review_title=Sephora_df['review_title'].fillna('No title'),
tertiary_category=Sephora_df['tertiary_category'].fillna('Uncategorized')
)
# 5. Handle skin, hair, and eye attributes
Sephora_df = Sephora_df.assign(
skin_tone=Sephora_df['skin_tone'].fillna('Not specified'),
eye_color=Sephora_df['eye_color'].fillna('Not specified'),
skin_type=Sephora_df['skin_type'].fillna('Not specified'),
hair_color=Sephora_df['hair_color'].fillna('Not specified')
)
# 6. Handle ingredient and highlight columns
Sephora_df = Sephora_df.assign(
ingredients=Sephora_df['ingredients'].fillna('Ingredients not listed'),
highlights=Sephora_df['highlights'].fillna('No highlights available')
)
# 7. Handle size-related columns
Sephora_df = Sephora_df.assign(
size=Sephora_df['size'].fillna('Unknown size'),
size_ml=Sephora_df['size_ml'].fillna(Sephora_df['size_ml'].median()),
size_oz=Sephora_df['size_oz'].fillna(Sephora_df['size_oz'].median())
)
# 8. Handle price per unit columns
Sephora_df = Sephora_df.assign(
price_per_ml=Sephora_df['price_per_ml'].fillna(Sephora_df['price_per_ml'].median()),
price_per_oz=Sephora_df['price_per_oz'].fillna(Sephora_df['price_per_oz'].median())
)
# 9. Print the result to confirm the missing values have been handled
missing_counts_after = Sephora_df.isna().sum()
print("Missing values after processing:")
print(missing_counts_after[missing_counts_after > 0])
Missing values after processing: Series([], dtype: int64)
numerical_cols = Sephora_df.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = Sephora_df.select_dtypes(include=['object']).columns
# 3. Strip whitespace from string columns (if any)
Sephora_df[categorical_cols] = Sephora_df[categorical_cols].apply(lambda x: x.str.strip())
In Project Part 2, we will focus on applying statistical and graphical analyses to extract insights and address the research questions we posed.
For review, this is our research questions:
For research problem 1, our primary interest lies in understanding how product-related factors influence average ratings of the product. Therefore we will mainly focus on product_df. The combined dataset will be used in research problem 2 & research problem 3 & research problem 4.
This part we will supplement some basic analysis of
product_df.columns
Index(['product_id', 'product_name', 'brand_id', 'brand_name', 'loves_count',
'rating', 'reviews', 'size', 'variation_type', 'variation_value',
'variation_desc', 'ingredients', 'price_usd', 'value_price_usd',
'sale_price_usd', 'limited_edition', 'new', 'online_only',
'out_of_stock', 'sephora_exclusive', 'highlights', 'primary_category',
'secondary_category', 'tertiary_category', 'child_count',
'child_max_price', 'child_min_price', 'size_oz', 'size_ml',
'price_per_oz', 'price_per_ml', 'contains_Glycerin',
'contains_Phenoxyethanol', 'contains_Caprylyl Glycol',
'contains_Tocopherol', 'contains_Limonene',
'contains_Ethylhexylglycerin', 'contains_Dimethicone',
'contains_Linalool', 'contains_Silica', 'contains_Butylene Glycol',
'contains_Citric Acid', 'contains_Mica', 'contains_Potassium Sorbate',
'contains_Water', 'contains_Caprylic/Capric Triglyceride',
'contains_Sodium Benzoate', 'contains_Citronellol',
'contains_Tocopheryl Acetate', 'contains_Propanediol',
'contains_Geraniol'],
dtype='object')
To analyze the relationship between product average ratings and related numerical values (such as Price_usd, Price_per_oz, Size_oz, Loves_count, and Reviews), we can calculate the correlation between rating and these numerical features in the product_df DataFrame. This will help us identify which factors are most strongly associated with ratings.
correlation_features = ['rating', 'price_usd', 'price_per_oz', 'size_oz', 'loves_count', 'reviews']
correlation_data = product_df[correlation_features]
corr_matrix = correlation_data.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()
# Relationship between rating and top 10 ingredients
correlation_features = ['rating', 'contains_Glycerin',
'contains_Phenoxyethanol', 'contains_Caprylyl Glycol',
'contains_Tocopherol', 'contains_Limonene',
'contains_Ethylhexylglycerin', 'contains_Dimethicone',
'contains_Linalool', 'contains_Silica', 'contains_Butylene Glycol']
correlation_data = product_df[correlation_features]
corr_matrix = correlation_data.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()
From the results above, we can observe that there is no significant relationship between loves_count, reviews, ingredients, and rating.
However, we will analyze the numerical variables in more detail in the next several sections, and also analyze the categorical variables using the Comparative & Multivariate Analysis.
# Use ChatGPT to generate a summary of the dataset
# Descriptive statistics for the 'rating' column and other relevant columns in product_df
rating_data = product_df['rating'] # Replace 'rating' with actual column name if different
# Calculating descriptive statistics
mean_rating = rating_data.mean()
median_rating = rating_data.median()
mode_rating = rating_data.mode().iloc[0] # Mode may return multiple values; take the first mode
range_rating = rating_data.max() - rating_data.min()
variance_rating = rating_data.var()
std_dev_rating = rating_data.std()
# Print results
print("Descriptive Statistics for Ratings:")
print(f"Mean: {mean_rating}")
print(f"Median: {median_rating}")
print(f"Mode: {mode_rating}")
print(f"Range: {range_rating}")
print(f"Variance: {variance_rating}")
print(f"Standard Deviation: {std_dev_rating}")
Descriptive Statistics for Ratings: Mean: 4.194512889483933 Median: 4.289350000000001 Mode: 5.0 Range: 4.0 Variance: 0.26697307784114505 Standard Deviation: 0.5166943756623881
Conclusions:
Mean and Median:
Mode:
Range:
Variance and Standard Deviation:
# Code generated by ChatGPT
# If you want to calculate these metrics for additional columns (e.g., 'price', 'review_count', etc.), you can loop through them:
relevant_columns = ['price_usd', 'price_per_oz', 'size_oz', 'loves_count', 'reviews'] # Replace with actual column names
# Loop through relevant columns and print statistics
for column in relevant_columns:
print(f"\nDescriptive Statistics for {column.capitalize()}:")
mean = product_df[column].mean()
median = product_df[column].median()
mode = product_df[column].mode().iloc[0] # Take the first mode if multiple
value_range = product_df[column].max() - product_df[column].min()
variance = product_df[column].var()
std_dev = product_df[column].std()
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Range: {value_range}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
Descriptive Statistics for Price_usd: Mean: 51.65559453732046 Median: 35.0 Mode: 30.0 Range: 1897.0 Variance: 2880.386721882422 Standard Deviation: 53.669234407455654 Descriptive Statistics for Price_per_oz: Mean: 160.71142475513528 Median: 32.35294117647059 Mode: 10.0 Range: 109999.86931818182 Variance: 2595230.6017293455 Standard Deviation: 1610.971943185028 Descriptive Statistics for Size_oz: Mean: 3.6344106906629747 Median: 1.7 Mode: 1.7 Range: 175.9998 Variance: 45.44301872925057 Standard Deviation: 6.741143725604029 Descriptive Statistics for Loves_count: Mean: 29179.56592889098 Median: 9880.0 Mode: 0 Range: 1401068 Variance: 4368168668.437603 Standard Deviation: 66092.12258989419 Descriptive Statistics for Reviews: Mean: 448.54552093476144 Median: 122.0 Mode: 2.0 Range: 21280.0 Variance: 1214365.4939714111 Standard Deviation: 1101.9825288866475
Conclusions and Why they are considered as related factors:
Price_usd:
Price_per_oz:
Size_oz:
Loves_count:
Reviews:
This is for testing correlation with continuous variables like Price_usd
Steps:
Define Hypotheses:
Price_usd).Conduct Regression:
rating as the dependent variable and Price_usd as the independent variable.Price_usd, Price_per_oz, etc.), use multiple linear regression.Choose Significance Level:
Interpret Results:
Price_usd (or other factors) is below 0.05, reject the null hypothesis, suggesting a significant relationship between rating and price.import statsmodels.api as sm
# GPT facilitated the generation of the following code
# Define the independent and dependent variables
X = product_df[['price_usd', 'price_per_oz', 'size_oz', 'loves_count', 'reviews']]
X = sm.add_constant(X) # Adds a constant term to the model
y = product_df['rating']
# Drop rows with NaN or Inf values
X = X.replace([float('inf'), float('-inf')], float('nan')).dropna()
y = y.loc[X.index] # Ensure y matches the filtered X index
# Fit the regression model
model = sm.OLS(y, X).fit()
# Print the summary of regression results
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: rating R-squared: 0.016
Model: OLS Adj. R-squared: 0.015
Method: Least Squares F-statistic: 20.15
Date: Wed, 06 Nov 2024 Prob (F-statistic): 5.28e-20
Time: 12:39:20 Log-Likelihood: -4163.9
No. Observations: 6341 AIC: 8340.
Df Residuals: 6335 BIC: 8380.
Df Model: 5
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 4.1332 0.010 408.638 0.000 4.113 4.153
price_usd 0.0010 0.000 7.844 0.000 0.001 0.001
price_per_oz -5.617e-06 3.6e-06 -1.560 0.119 -1.27e-05 1.44e-06
size_oz 0.0024 0.001 2.658 0.008 0.001 0.004
loves_count -1.248e-08 1.11e-07 -0.112 0.911 -2.31e-07 2.06e-07
reviews 3.13e-05 6.74e-06 4.644 0.000 1.81e-05 4.45e-05
==============================================================================
Omnibus: 1746.116 Durbin-Watson: 1.609
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6024.500
Skew: -1.366 Prob(JB): 0.00
Kurtosis: 6.917 Cond. No. 1.36e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Regression Results Analysis:
Model Fit:
Model Significance:
Coefficients:
price_usd (p = 0.000): Small positive effect on ratings.size_oz (p = 0.008): Slight positive association with ratings.reviews (p = 0.000): Higher reviews linked to slightly higher ratings.price_per_oz and loves_count have no significant effect.Multicollinearity:
Summary:
While price_usd, size_oz, and reviews show a weak positive relationship with ratings, the low R-squared indicates these factors alone don’t strongly predict ratings. Further exploration with additional features is recommended.
This is for testing differences between categories, like product type, etc.
Steps:
Define Hypotheses:
Conduct ANOVA:
product_type (or any other categorical variable) is relevant. ANOVA helps test if the means of rating vary significantly across product types.Choose Significance Level:
Interpret Results:
pd.set_option('display.max_columns', None)
product_df.head()
| product_id | product_name | brand_id | brand_name | loves_count | rating | reviews | size | variation_type | variation_value | variation_desc | ingredients | price_usd | value_price_usd | sale_price_usd | limited_edition | new | online_only | out_of_stock | sephora_exclusive | highlights | primary_category | secondary_category | tertiary_category | child_count | child_max_price | child_min_price | size_oz | size_ml | price_per_oz | price_per_ml | contains_Glycerin | contains_Phenoxyethanol | contains_Caprylyl Glycol | contains_Tocopherol | contains_Limonene | contains_Ethylhexylglycerin | contains_Dimethicone | contains_Linalool | contains_Silica | contains_Butylene Glycol | contains_Citric Acid | contains_Mica | contains_Potassium Sorbate | contains_Water | contains_Caprylic/Capric Triglyceride | contains_Sodium Benzoate | contains_Citronellol | contains_Tocopheryl Acetate | contains_Propanediol | contains_Geraniol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P473671 | Fragrance Discovery Set | 6342 | 19-69 | 6320 | 3.6364 | 11.0 | NaN | NaN | NaN | NaN | [Capri Eau de Parfum:, Alcohol Denat. (SD Alco... | 35.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Warm &Spicy Scen... | Fragrance | Value & Gift Sets | Perfume Gift Sets | 0 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | True | False | False | True | False | False | False | False | False | False | False | False | True | False | False | True |
| 1 | P473668 | La Habana Eau de Parfum | 6342 | 19-69 | 3827 | 4.1538 | 13.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 85.0 | 30.0 | 3.4 | 100.0 | 57.352941 | 1.95 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | P473662 | Rainbow Bar Eau de Parfum | 6342 | 19-69 | 3253 | 4.2500 | 16.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 75.0 | 30.0 | 3.4 | 100.0 | 57.352941 | 1.95 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False |
| 3 | P473660 | Kasbah Eau de Parfum | 6342 | 19-69 | 3018 | 4.4762 | 21.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 75.0 | 30.0 | 3.4 | 100.0 | 57.352941 | 1.95 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | True |
| 4 | P473658 | Purple Haze Eau de Parfum | 6342 | 19-69 | 2691 | 3.2308 | 13.0 | 3.4 oz/ 100 mL | Size + Concentration + Formulation | 3.4 oz/ 100 mL | NaN | [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... | 195.0 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | ['Unisex/ Genderless Scent', 'Layerable Scent'... | Fragrance | Women | Perfume | 2 | 75.0 | 30.0 | 3.4 | 100.0 | 57.352941 | 1.95 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True |
from scipy import stats
# GPT facilitated the generation of the following code
# Check if 'new' and 'rating' columns exist and handle missing values in 'rating'
if 'new' in product_df.columns and 'rating' in product_df.columns:
# Remove NaN values from 'rating' for each group
group_1 = product_df[product_df['new'] == 1]['rating'].dropna()
group_0 = product_df[product_df['new'] == 0]['rating'].dropna()
# Ensure both groups have data before running ANOVA
if len(group_1) > 0 and len(group_0) > 0:
# Conduct ANOVA
anova_result = stats.f_oneway(group_1, group_0)
print("ANOVA Result:", anova_result)
else:
print("One of the groups has no data.")
else:
print("Error: 'new' or 'rating' column not found in product_df.")
ANOVA Result: F_onewayResult(statistic=86.50109826723633, pvalue=1.7601462104023535e-20)
Conclusion:
There is a significant difference (pvalue=1.760e-20) in average product ratings between new and non-new products.
# GPT facilitated the generation of the following code
# Check if 'online_only' and 'rating' columns exist and handle missing values in 'rating'
if 'new' in product_df.columns and 'rating' in product_df.columns:
# Remove NaN values from 'rating' for each group
group_1 = product_df[product_df['online_only'] == 1]['rating'].dropna()
group_0 = product_df[product_df['online_only'] == 0]['rating'].dropna()
# Ensure both groups have data before running ANOVA
if len(group_1) > 0 and len(group_0) > 0:
# Conduct ANOVA
anova_result = stats.f_oneway(group_1, group_0)
print("ANOVA Result:", anova_result)
else:
print("One of the groups has no data.")
else:
print("Error: 'online_only' or 'rating' column not found in product_df.")
ANOVA Result: F_onewayResult(statistic=3.9889151861260834, pvalue=0.045833473720411584)
Conclusion:
There is a significant difference (pvalue=0.0458) in average product ratings between online_only and non-online_only products.
# GPT facilitated the generation of the following code
# Check if 'limited_edition' and 'rating' columns exist and handle missing values in 'rating'
if 'new' in product_df.columns and 'rating' in product_df.columns:
# Remove NaN values from 'rating' for each group
group_1 = product_df[product_df['limited_edition'] == 1]['rating'].dropna()
group_0 = product_df[product_df['limited_edition'] == 0]['rating'].dropna()
# Ensure both groups have data before running ANOVA
if len(group_1) > 0 and len(group_0) > 0:
# Conduct ANOVA
anova_result = stats.f_oneway(group_1, group_0)
print("ANOVA Result:", anova_result)
else:
print("One of the groups has no data.")
else:
print("Error: 'limited_edition' or 'rating' column not found in product_df.")
ANOVA Result: F_onewayResult(statistic=21.401836958075258, pvalue=3.781165073111685e-06)
Conclusion:
There is a significant difference (pvalue=3.7811e-06) in average product ratings between limited_edition and non-limited_edition products.
# Code generated by ChatGPT
# Automatically find all columns that start with "contains_"
columns = [col for col in product_df.columns if col.startswith("contains_")]
# Dictionary to store ANOVA results
anova_results = {}
# Loop through each ingredient column and perform ANOVA
for col in columns:
# Check if 'rating' exists in product_df
if 'rating' in product_df.columns:
# Remove NaN values from 'rating' for each group
group_1 = product_df[product_df[col] == 1]['rating'].dropna()
group_0 = product_df[product_df[col] == 0]['rating'].dropna()
# Ensure both groups have data before running ANOVA
if len(group_1) > 0 and len(group_0) > 0:
# Conduct ANOVA
anova_result = stats.f_oneway(group_1, group_0)
anova_results[col] = anova_result.pvalue # Store p-value
else:
print(f"Insufficient data for {col} (one of the groups has no data).")
else:
print("Error: 'rating' column not found in product_df.")
break # Exit the loop if 'rating' column is missing
# Display only significant results
significant_results = {col: p for col, p in anova_results.items() if p < 0.05}
print("\nSignificant ANOVA Results (p < 0.05):")
for col, p in significant_results.items():
print(f"{col}: p-value = {p}")
Significant ANOVA Results (p < 0.05): contains_Phenoxyethanol: p-value = 1.3530427418941688e-06 contains_Tocopherol: p-value = 0.005581632113425609 contains_Limonene: p-value = 9.816066072505535e-13 contains_Linalool: p-value = 4.1893522770234076e-09 contains_Silica: p-value = 0.0003927895758699168 contains_Butylene Glycol: p-value = 0.004688569041837416 contains_Water: p-value = 7.234178076608262e-05 contains_Caprylic/Capric Triglyceride: p-value = 0.005249331521078461 contains_Citronellol: p-value = 1.0400385863489007e-08 contains_Tocopheryl Acetate: p-value = 0.003148642697125757 contains_Geraniol: p-value = 1.3540721919634177e-08
Conclusion:
There is a significant difference in average product ratings between products with and without the following ingredients:
Boolean features: 'limited_edition', 'new', 'online_only', 'contains_Phenoxyethanol', 'contains_Tocopherol', 'contains_Limonene', 'contains_Linalool', 'contains_Silica', 'contains_Butylene Glycol', 'contains_Water', 'contains_Caprylic/Capric Triglyceride', 'contains_Citronellol', 'contains_Tocopheryl Acetate', 'contains_Geraniol'
import seaborn as sns
import matplotlib.pyplot as plt
# Code generated by ChatGPT
# List of boolean features to plot
boolean_features = [
'limited_edition', 'new', 'online_only', 'contains_Phenoxyethanol',
'contains_Tocopherol', 'contains_Limonene', 'contains_Linalool',
'contains_Silica', 'contains_Butylene Glycol', 'contains_Water',
'contains_Caprylic/Capric Triglyceride', 'contains_Citronellol',
'contains_Tocopheryl Acetate', 'contains_Geraniol'
]
# Set up the plot style and size
sns.set(style="whitegrid")
num_features = len(boolean_features)
num_cols = 2 # Number of columns for the plots
num_rows = (num_features + 1) // num_cols # Calculate required rows
plt.figure(figsize=(15, num_rows * 5)) # Adjust figure size
# Loop through each boolean feature and create a violin plot
for i, feature in enumerate(boolean_features, 1):
plt.subplot(num_rows, num_cols, i)
sns.violinplot(x=feature, y='rating', data=product_df, hue=feature, palette="muted", legend=False)
plt.title(f"Rating Distribution by {feature}")
plt.xlabel(feature)
plt.ylabel("Rating")
# Adjust layout for better display
plt.tight_layout()
plt.show()
To enhance clarity and organization, we use a hexbin plot instead of a scatter plot.
# Hexbin Plot (Rating vs Price)
# Code generated by ChatGPT
# Filter products with price <= 350 USD
filtered_df = product_df[product_df['price_usd'] <= 350]
plt.figure(figsize=(8, 6))
plt.hexbin(filtered_df['price_usd'], filtered_df['rating'], gridsize=30,
cmap='viridis', mincnt=1)
plt.colorbar(label='Number of Products')
plt.xlabel('Price (USD)')
plt.ylabel('Rating')
plt.title('Hexbin Plot of Rating vs Price (<= 350 USD)')
plt.show()
# Hexbin Plot (Rating vs Unit Price)
# Filter products with unit price <= 600 USD
filtered_df = product_df[product_df['price_per_oz'] <= 600]
plt.figure(figsize=(8, 6))
plt.hexbin(filtered_df['price_per_oz'], filtered_df['rating'], gridsize=30,
cmap='viridis', mincnt=1)
plt.colorbar(label='Number of Products')
plt.xlabel('Price (USD)')
plt.ylabel('Rating')
plt.title('Hexbin Plot of Rating vs Price (<= 350 USD)')
plt.show()
# Code generated by ChatGPT with my prompt
bins = [0, 10, 20, 50, 100, 200, 500, product_df['price_usd'].max()]
labels = ['0-10', '10-20', '20-50', '50-100', '100-200', '200-500', f'500+']
product_df['price_bin'] = pd.cut(product_df['price_usd'], bins=bins, labels=labels, include_lowest=True)
avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()
plt.figure(figsize=(6, 4))
plt.bar(avg_rating_by_bin['price_bin'], avg_rating_by_bin['rating'])
plt.title('Average Rating by Price Bin', fontsize=16)
plt.xlabel('Price Range (USD)', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\3981940082.py:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()
# Code generated by ChatGPT with my prompt
bins = [0, 50, 100, 200, 500, product_df['price_per_oz'].max()]
labels = ['0-50', '50-100', '100-200', '200-500', f'500+']
product_df['price_bin'] = pd.cut(product_df['price_per_oz'], bins=bins, labels=labels, include_lowest=True)
avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()
plt.figure(figsize=(6, 4))
plt.bar(avg_rating_by_bin['price_bin'], avg_rating_by_bin['rating'])
plt.title('Average Rating by Price Bin', fontsize=16)
plt.xlabel('Price Range (USD)', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\63299085.py:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()
Analysis:
From the figure above, we can observe slight differences between total price and average rating. However, no noticeable differences are found between unit price and average rating.
Additionally, if we aim to achieve higher ratings, setting the price within the 20-200 range may be effective.
# Primary Category vs Average Rating
product_df['rating'] = pd.to_numeric(product_df['rating'], errors='coerce')
avg_rating_by_category = product_df.groupby('primary_category')['rating'].mean().\
reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['primary_category', 'average_rating']
print(avg_rating_by_category)
primary_category average_rating 2 Gifts 4.563450 5 Men 4.504992 8 Tools & Brushes 4.271458 1 Fragrance 4.230889 7 Skincare 4.228890 3 Hair 4.201113 0 Bath & Body 4.195015 4 Makeup 4.146845 6 Mini Size 4.005665
# Secondary Category vs Average Rating
avg_rating_by_category = product_df.groupby('secondary_category')['rating'].mean().\
reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['secondary_category', 'average_rating']
print(avg_rating_by_category)
secondary_category average_rating 29 Other Needs 4.682980 32 Shaving 4.641243 25 Men 4.548450 23 Makeup Palettes 4.513805 18 Hair Tools 4.496691 4 Beauty Supplements 4.421550 1 Bath & Body 4.377929 34 Skincare 4.353302 10 Cheek 4.323424 3 Beauty Accessories 4.322500 11 Cleansers 4.314373 37 Treatments 4.306111 8 Brushes & Applicators 4.305875 27 Moisturizers 4.277424 24 Masks 4.259330 17 Hair Styling & Treatments 4.244675 7 Body Moisturizers 4.237487 40 Women 4.233255 31 Shampoo & Conditioner 4.214010 20 Lip 4.209429 2 Bath & Shower 4.196601 6 Body Care 4.164788 21 Lip Balms & Treatments 4.149557 38 Value & Gift Sets 4.144749 19 High Tech Tools 4.142364 14 Face 4.139327 35 Sunscreen 4.125169 39 Wellness 4.123166 13 Eye Care 4.111931 36 Tools 4.066703 26 Mini Size 4.066505 0 Accessories 4.052295 12 Eye 4.050572 5 Beauty Tools 4.033523 9 Candles & Home Scents 4.021335 30 Self Tanners 3.944451 22 Makeup 3.926427 16 Hair 3.898375 33 Shop by Concern 3.895000 28 Nail 3.878163 15 Fragrance 3.865240
# Secondary Category vs Average Rating
avg_rating_by_category = product_df.groupby('tertiary_category')['rating'].mean().\
reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['tertiary_category', 'average_rating']
print(avg_rating_by_category)
tertiary_category average_rating 82 Lip Brushes 4.711833 108 Shaving 4.709130 91 Makeup Bags & Travel Cases 4.690660 12 Body Products 4.682980 54 Face Wash 4.624411 .. ... ... 32 Deodorant & Antiperspirant 3.764542 5 Bath Soaks & Bubble Bath 3.705443 30 Damaged Hair 3.696500 3 BB & CC Cream 3.669130 93 Manicure & Pedicure Tools 2.285700 [118 rows x 2 columns]
Analysis:
From the results, we can observe the following trends:
Primary Category:
Secondary Category:
Tertiary Category:
These results highlight areas where certain products excel and others fall short, suggesting that focusing on customer feedback and product refinement in the lower-performing categories could enhance overall satisfaction.
Note: GPT facilitate writing.
# brand_name vs Average Rating
avg_rating_by_category = product_df.groupby('brand_name')['rating'].mean().\
reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['brand_name', 'average_rating']
print(avg_rating_by_category)
brand_name average_rating 78 Erno Laszlo 5.000000 10 Aquis 4.904800 158 MACRENE actives 4.889420 161 MARA 4.823860 32 CANOPY 4.813733 .. ... ... 47 Christophe Robin 3.154767 101 Good Dye Young 3.062050 194 Overose 2.848450 253 The Maker NaN 299 philosophy NaN [304 rows x 2 columns]
low_perform_brand = avg_rating_by_category['brand_name'].tail(35)
Analysis:
Top Brands:
Lower-Rated Brands:
Lowest-Rated Brand:
High-performing brands should maintain their quality, while underperforming brands need to address product and customer experience issues.
Note: GPT facilitate writing.
# Online vs Average Rating
avg_rating_by_category = product_df.groupby('online_only')['rating'].mean().\
reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['online_only', 'average_rating']
print(avg_rating_by_category)
online_only average_rating 0 0 4.200374 1 1 4.172408
Non-online products (4.20) have slightly higher ratings than online-only products (4.17).
# Code generated by ChatGPT with my prompt
# Calculate average rating by both primary category and online_only status
avg_rating_online_category = product_df.groupby(['primary_category', 'online_only'])['rating'].mean().reset_index()
# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='primary_category', columns='online_only', values='rating')
pivot_table.columns = ['Non-Online', 'Online']
# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']
pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print(pivot_table)
Non-Online Online Rating_Difference primary_category Mini Size 3.998237 4.058100 0.059863 Skincare 4.218118 4.271438 0.053320 Tools & Brushes 4.271543 4.271013 -0.000530 Hair 4.205973 4.191046 -0.014927 Bath & Body 4.222076 4.151288 -0.070788 Fragrance 4.259239 4.144556 -0.114683 Men 4.542989 4.382857 -0.160132 Makeup 4.164505 3.998730 -0.165774 Gifts 4.563450 NaN NaN
Analysis:
Primary Category Comparison:
Higher Online Ratings:
Minimal Difference:
Higher Non-Online Ratings:
Conclusion:
Note: GPT facilitate writing.
# Calculate average rating by both secondary_category and online_only status
avg_rating_online_category = product_df.groupby(['secondary_category', 'online_only'])['rating'].mean().reset_index()
# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='secondary_category', columns='online_only', values='rating')
pivot_table.columns = ['Non-Online', 'Online']
# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']
pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print(pivot_table)
Non-Online Online Rating_Difference secondary_category Beauty Tools 3.993805 4.867600 0.873795 Bath & Body 4.185900 4.633967 0.448067 Fragrance 3.772564 4.120100 0.347536 Sunscreen 4.091897 4.282774 0.190877 Body Care 4.059471 4.240400 0.180929 Wellness 4.081017 4.214208 0.133191 Moisturizers 4.252550 4.374208 0.121658 Masks 4.243141 4.364555 0.121413 Self Tanners 3.920877 4.019887 0.099010 Cleansers 4.300014 4.392433 0.092419 Value & Gift Sets 4.118418 4.204760 0.086341 Eye Care 4.103188 4.155081 0.051892 Tools 4.050158 4.087829 0.037671 Skincare 4.347367 4.384560 0.037193 Hair Styling & Treatments 4.236636 4.266542 0.029906 Other Needs 4.671633 4.700000 0.028367 Lip Balms & Treatments 4.144431 4.168485 0.024053 Mini Size 4.066819 4.065684 -0.001135 Hair 3.898549 3.897257 -0.001292 Treatments 4.307763 4.298656 -0.009107 High Tech Tools 4.147954 4.137593 -0.010362 Cheek 4.327744 4.295345 -0.032399 Shampoo & Conditioner 4.226891 4.184653 -0.042237 Women 4.252738 4.174715 -0.078023 Face 4.147996 4.067131 -0.080864 Hair Tools 4.512100 4.427350 -0.084750 Body Moisturizers 4.277604 4.162533 -0.115071 Eye 4.062154 3.898110 -0.164044 Candles & Home Scents 4.079350 3.907596 -0.171754 Nail 3.946375 3.769025 -0.177350 Makeup 3.940203 3.754230 -0.185973 Makeup Palettes 4.561207 4.371600 -0.189607 Bath & Shower 4.272463 4.067637 -0.204826 Lip 4.236904 4.008690 -0.228214 Shaving 4.677833 4.421700 -0.256133 Men 4.576584 4.290915 -0.285669 Accessories 4.124686 3.726538 -0.398149 Beauty Accessories 4.700000 4.228125 -0.471875 Brushes & Applicators 4.347053 3.800300 -0.546753 Beauty Supplements 4.421550 NaN NaN Shop by Concern 3.895000 NaN NaN
Analysis:
Secondary Category Comparison:
Higher Online Ratings:
Minimal Difference:
Higher Non-Online Ratings:
Conclusion:
Note: GPT facilitate writing.
# Calculate average rating by both primary category and online_only status
avg_rating_online_category = product_df.groupby(['tertiary_category', 'online_only'])['rating'].mean().reset_index()
# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='tertiary_category', columns='online_only', values='rating')
# Rename columns for clarity
pivot_table.columns = ['Non-Online', 'Online']
# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']
# Drop rows with NaN values in 'Rating_Difference' or either 'Online'/'Non-Online' columns
pivot_table = pivot_table.dropna()
# Sort by rating difference and display the top and bottom rows
pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print("pivot_table top 10:")
print(pivot_table.head(10))
print("pivot_table last 10:")
print(pivot_table.tail(10))
pivot_table top 10:
Non-Online Online Rating_Difference
tertiary_category
Teeth Whitening 3.239400 4.462040 1.222640
Blotting Papers 3.823067 5.000000 1.176933
Cologne Gift Sets 4.187500 5.000000 0.812500
Curling Irons 3.864867 4.568389 0.703522
Hair Thinning & Hair Loss 3.917200 4.544000 0.626800
Tinted Moisturizer 4.012629 4.418500 0.405871
Sheet Masks 4.080966 4.486100 0.405134
Eye Brushes 4.476147 4.857100 0.380953
Concealer 4.205592 4.541175 0.335583
Eye Masks 3.991108 4.307500 0.316392
pivot_table last 10:
Non-Online Online Rating_Difference
tertiary_category
Brush Cleaners 4.390150 3.818067 -0.572083
Eyelash Curlers 4.001762 3.379800 -0.621962
Face Sets 4.306533 3.653525 -0.653008
Hair Dye & Root Touch-Ups 4.311633 3.491200 -0.820433
False Eyelashes 4.092528 3.270025 -0.822503
Lip Gloss 4.240900 3.411933 -0.828967
Sponges & Applicators 4.198200 3.275133 -0.923067
Bath Soaks & Bubble Bath 4.066560 2.802650 -1.263910
Makeup Bags & Travel Cases 4.946650 3.666700 -1.279950
Lip Plumper 4.066493 1.000000 -3.066493
Analysis:
Tertiary Category Comparison:
Higher Online Ratings:
Minimal Difference:
Higher Non-Online Ratings:
Conclusion:
Note: GPT facilitate writing.
# Calculate average rating by both primary category and online_only status
avg_rating_online_category = product_df.groupby(['brand_name', 'online_only'])['rating'].mean().reset_index()
# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='brand_name', columns='online_only', values='rating')
# Rename columns for clarity
pivot_table.columns = ['Non-Online', 'Online']
# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']
# Drop rows with NaN values in 'Rating_Difference' or either 'Online'/'Non-Online' columns
pivot_table = pivot_table.dropna()
# Sort by rating difference and display the top and bottom rows
pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print("pivot_table top 10:")
print(pivot_table.head(10))
print("pivot_table last 10:")
print(pivot_table.tail(10))
pivot_table top 10:
Non-Online Online Rating_Difference
brand_name
Christophe Robin 2.760600 3.943100 1.182500
ROSE Ingleton MD 3.595200 4.667067 1.071867
TWEEZERMAN 3.863853 4.800000 0.936147
CLEAN RESERVE 4.162917 5.000000 0.837083
BURBERRY 4.416850 5.000000 0.583150
COOLA 3.715157 4.257425 0.542268
HUM Nutrition 4.194833 4.713640 0.518807
Jo Malone London 3.859558 4.355029 0.495471
GLO Science 3.998700 4.462040 0.463340
Josie Maran 4.298509 4.751800 0.453291
pivot_table last 10:
Non-Online Online Rating_Difference
brand_name
Overose 3.363600 2.333300 -1.030300
innisfree 4.316241 3.250000 -1.066241
Valentino 4.452656 3.333350 -1.119306
Floral Street 4.237330 3.099044 -1.138286
Kaja 4.287616 3.111100 -1.176516
Algenist 4.184797 3.000000 -1.184797
Urban Decay 4.185637 2.900000 -1.285637
Lilly Lashes 4.080507 2.694450 -1.386057
MILK MAKEUP 4.030039 2.416650 -1.613389
Montblanc 4.719743 2.666700 -2.053043
Analysis:
Brand Comparison:
Higher Online Ratings:
Minimal Difference:
Higher Non-Online Ratings:
Conclusion & Implications: Whether to focus on online or offline sales should depend on brand positioning and the nature of the primary products sold. Enhancing online presentations for brands with lower online ratings—such as offering fragrance sampling options or virtual try-ons for cosmetics—could help close the satisfaction gap.
Note: GPT facilitate writing.
Based on the descriptive, inferential, and graphical analyses, we can summarize the following key findings:
Limited Predictive Power of Price, Size, and Reviews:
While price_usd, size_oz, and reviews have a weak positive relationship with ratings, the low R-squared value suggests these factors alone are insufficient to strongly predict ratings. Additional features may be needed for more robust prediction.
Significant Differences Based on Boolean Features:
There are significant differences in average product ratings based on whether certain attributes are true or false, including limited_edition, new, online_only, and specific ingredients like contains_Phenoxyethanol, contains_Tocopherol, contains_Limonene, contains_Linalool, contains_Silica, contains_Butylene Glycol, contains_Water, contains_Caprylic/Capric Triglyceride, contains_Citronellol, contains_Tocopheryl Acetate, and contains_Geraniol.
Variation in Ratings Across Brands and Categories:
Average ratings vary widely across brands and categories, with Gifts and Men’s products as top-performing categories, likely due to well-met customer needs, while Mini Size and Makeup receive lower ratings, possibly due to value concerns or unmet expectations. Among brands, Erno Laszlo and Aquis stand out with high ratings for quality, while Christophe Robin and Good Dye Young show lower ratings, suggesting quality or expectation gaps that may impact satisfaction.
Impact of Online Only vs. Offline Availability:
For different brands and product categories, average ratings also differ based on whether the product is online-only. Categories such as Beauty Tools and Teeth Whitening perform better online, where product reviews and detailed descriptions enhance customer confidence. In contrast, categories like Fragrance and Makeup see higher ratings offline, as customers prefer to experience the scent or test colors and textures in person. This suggests that distribution strategies can significantly impact customer satisfaction, with certain categories benefiting from a strong online presence while others may perform better with in-store availability.
Note: GPT facilitate writing. The advantage of combined dataset will be explained in the synthesis section in research problem 2 & 3 & 4.
Let's first recall the combined dataset again:
Sephora_df.sample(5)
| author_id | rating | is_recommended | helpfulness | total_feedback_count | total_neg_feedback_count | total_pos_feedback_count | submission_time | review_text | review_title | skin_tone | eye_color | skin_type | hair_color | product_id | product_name | brand_name | price_usd | pos_feedback_percentage | neg_feedback_percentage | brand_id | child_count | child_max_price | child_min_price | contains_Butylene Glycol | contains_Caprylic/Capric Triglyceride | contains_Caprylyl Glycol | contains_Citric Acid | contains_Citronellol | contains_Dimethicone | contains_Ethylhexylglycerin | contains_Geraniol | contains_Glycerin | contains_Limonene | contains_Linalool | contains_Mica | contains_Phenoxyethanol | contains_Potassium Sorbate | contains_Propanediol | contains_Silica | contains_Sodium Benzoate | contains_Tocopherol | contains_Tocopheryl Acetate | contains_Water | highlights | ingredients | limited_edition | loves_count | new | online_only | out_of_stock | price_per_ml | price_per_oz | primary_category | reviews | sale_price_usd | secondary_category | sephora_exclusive | size | size_ml | size_oz | tertiary_category | value_price_usd | variation_desc | variation_type | variation_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 966458 | 22658221971 | 1.0 | 0.0 | 0.615385 | 13.0 | 5.0 | 8.0 | 2022-05-09 | I was sent this complimentary from Peace Out S... | Smells SO Awful! | fairLight | green | combination | brown | P480629 | Dark Spots Serum | Peace Out | 29.0 | 0.615385 | 0.384615 | 6209 | 0 | 29.0 | 29.0 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | True | True | False | False | False | False | ['Vegan', 'Good for: Dullness/Uneven Texture',... | NaN | 0 | 7931 | 0 | 0 | 0 | 0.966667 | 29.000000 | Skincare | 252.0 | 29.0 | Treatments | 1 | 1 oz/ 30 mL | 30.0 | 1.0 | Face Serums | 29.0 | No variation | Size | 1 oz/ 30 mL |
| 84728 | 23646034852 | 5.0 | 1.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 2019-08-12 | This peel made my skin feel refreshed. I was... | Great Peel | fair | brown | combination | blonde | P269122 | Alpha Beta Extra Strength Daily Peel Pads | Dr. Dennis Gross Skincare | 92.0 | 1.000000 | 0.000000 | 5668 | 2 | 153.0 | 20.0 | False | False | False | True | False | False | False | False | True | False | False | False | True | False | False | False | True | False | True | False | ['Good for: Dullness/Uneven Texture', 'Clean a... | NaN | 0 | 234295 | 0 | 0 | 0 | 0.950000 | 26.470588 | Skincare | 7412.0 | 92.0 | Treatments | 1 | 30 Treatments + 5 Bonus | 50.0 | 1.7 | Facial Peels | 102.0 | No variation | Size | 30 Treatments + 5 Bonus |
| 910579 | 5269466174 | 5.0 | 1.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 2022-06-15 | I absolutely love Tula products so when I foun... | Buy it! | mediumTan | brown | combination | black | P475180 | On The Go Best Sellers Travel Kit | TULA Skincare | 49.0 | 1.000000 | 0.000000 | 6373 | 0 | 49.0 | 49.0 | True | True | True | True | True | True | True | False | True | True | False | False | True | True | True | True | True | False | True | False | ['Good for: Dullness/Uneven Texture', 'Hydrati... | NaN | 0 | 5711 | 0 | 0 | 0 | 0.950000 | 26.470588 | Skincare | 93.0 | 49.0 | Value & Gift Sets | 0 | Unknown size | 50.0 | 1.7 | Uncategorized | 49.0 | No variation | Unknown | Unknown |
| 438308 | 12600690044 | 2.0 | 0.0 | 0.571429 | 14.0 | 6.0 | 8.0 | 2021-08-29 | i like how it does not leave your skin looking... | No title | lightMedium | brown | combination | black | P429242 | Clear Sunscreen Stick SPF 50+ | Shiseido | 30.0 | 0.571429 | 0.428571 | 5337 | 0 | 30.0 | 30.0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | True | False | False | ['Without Phthalates', 'allure 2021 Best of Be... | NaN | 0 | 74272 | 0 | 0 | 0 | 0.950000 | 42.857143 | Skincare | 1529.0 | 30.0 | Sunscreen | 0 | 0.7 oz/ 20 g | 50.0 | 0.7 | Face Sunscreen | 30.0 | No variation | Size | 0.7 oz/ 20 g |
| 1097391 | 9631991606 | 5.0 | 1.0 | 0.969697 | 33.0 | 1.0 | 32.0 | 2017-04-02 | I’m African American and I have dark spots tha... | Works perfectly for me | Not specified | Not specified | dry | Not specified | P94421 | Vinoperfect Radiance Dark Spot Serum Vitamin C... | Caudalie | 82.0 | 0.969697 | 0.030303 | 4171 | 0 | 82.0 | 82.0 | True | False | True | True | False | False | False | False | True | False | False | False | False | True | False | False | False | False | False | False | ['Vegan', 'Good for: Dullness/Uneven Texture',... | NaN | 0 | 166423 | 0 | 0 | 0 | 2.733333 | 82.000000 | Skincare | 2911.0 | 82.0 | Treatments | 0 | 1 oz/ 30 mL | 30.0 | 1.0 | Face Serums | 82.0 | No variation | Size | 1 oz/ 30 mL |
in this part we extract the mean and std values of customers' rating value over time to study the overall trend.
# code generation assisted by copilot
# Convert submission_time to datetime
Sephora_df['submission_time'] = pd.to_datetime(Sephora_df['submission_time'])
# Set submission_time as the index
Sephora_df.set_index('submission_time', inplace=True)
# Resample by month and calculate mean and standard deviation of ratings
monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
# Display the result
print(monthly_sentiment)
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\193236108.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
rating_mean rating_std submission_time 2008-08-31 4.250000 0.967050 2008-09-30 4.478715 0.903466 2008-10-31 4.367816 1.076722 2008-11-30 4.459519 0.995055 2008-12-31 4.464143 0.965321 ... ... ... 2022-11-30 4.269916 1.106625 2022-12-31 4.477447 0.990538 2023-01-31 4.351689 1.109653 2023-02-28 4.208547 1.257636 2023-03-31 4.309767 1.161733 [176 rows x 2 columns]
From this result we can access the mean and std value of customers' ratings of certain time period on a monthly basis.
With the help of visualization tools, we can see the general trend in the rating data of customers in time series by doing:
# code generation assisted by copilot
import pandas as pd
import matplotlib.pyplot as plt
# Resample by month and calculate mean and standard deviation of ratings
monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
# Plot the mean and standard deviation of ratings over time
plt.figure(figsize=(12, 6))
plt.plot(monthly_sentiment.index, monthly_sentiment['rating_mean'], label='Mean Rating', color='blue')
plt.fill_between(monthly_sentiment.index,
monthly_sentiment['rating_mean'] - monthly_sentiment['rating_std'],
monthly_sentiment['rating_mean'] + monthly_sentiment['rating_std'],
color='blue', alpha=0.2, label='Standard Deviation')
plt.title('Customer Sentiment Over Time')
plt.xlabel('Time')
plt.ylabel('Rating')
plt.legend()
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\1542363257.py:8: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
From the figure above we can see that the rating remains rather stable with all mean values roughly around 4.0 - 4.5, however the std values cover a rather large range, meaning that the ratings between customers vary among each other.
# code generation assisted by ChatGPT
# Get unique primary categories
primary_categories = Sephora_df['secondary_category'].unique()
num_categories = len(primary_categories)
# Set up a grid for subplots (adjust rows and columns based on number of categories)
rows = (num_categories + 1) // 2 # 2 columns per row
fig, axes = plt.subplots(rows, 2, figsize=(15, rows * 5))
axes = axes.flatten() # Flatten the axes array for easy iteration
# Loop through each primary category and plot on respective subplot
for i, category in enumerate(primary_categories):
# Filter the data for the current primary category
category_data = Sephora_df[Sephora_df['secondary_category'] == category]
# Resample by month and calculate mean and standard deviation of ratings
monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
# Plot on the i-th subplot
axes[i].plot(monthly_sentiment.index, monthly_sentiment['rating_mean'], label='Mean Rating', color='blue')
axes[i].fill_between(monthly_sentiment.index,
monthly_sentiment['rating_mean'] - monthly_sentiment['rating_std'],
monthly_sentiment['rating_mean'] + monthly_sentiment['rating_std'],
color='blue', alpha=0.2, label='Standard Deviation')
axes[i].set_title(f'Customer Sentiment Over Time - {category}')
axes[i].set_xlabel('Time')
axes[i].set_ylabel('Rating')
axes[i].legend()
# Hide any empty subplots if the number of categories is odd
for j in range(i + 1, len(axes)):
axes[j].axis('off')
plt.tight_layout()
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\4191964277.py:18: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
Analysis: (GPT facilitate writing)
Stable Categories (e.g., Eye Care, Treatments): User ratings are consistent over time.
Declining Categories (e.g. Lip Balms & Treatments): User ratings show a downward trend, suggesting diminishing satisfaction.
Fluctuating but Stabilizing Categories (e.g., Sunscreen, High Tech Tools, Self Tanners): Ratings fluctuate initially but stabilize over time.
In summary, stable categories should prioritize consistency, declining ones require revitalization, and stabilizing categories should focus on solidifying customer trust. This targeted approach helps address customer needs effectively across different product types.
# code generation assisted by ChatGPT
low_perform_brand = low_perform_brand.to_list()
# Filter the dataset for only the low-performing brands
filtered_df = Sephora_df[Sephora_df['brand_name'].isin(low_perform_brand)]
primary_categories = filtered_df['brand_name'].unique() # Get unique brand names from filtered data
num_categories = len(primary_categories)
filtered_df['brand_name'].value_counts()
brand_name Supergoop! 19675 Isle of Paradise 5985 TAN-LUXE 1490 Dr. Lara Devgan Scientific Beauty 1421 Slip 692 NuFACE 495 Glossier 386 INC.redible 218 iluminage 100 Name: count, dtype: int64
# Set up a grid for subplots (adjust rows and columns based on number of categories)
rows = (num_categories + 1) // 2 # 2 columns per row
fig, axes = plt.subplots(rows, 2, figsize=(15, rows * 5))
axes = axes.flatten() # Flatten the axes array for easy iteration
# Loop through each low-performing brand and plot on respective subplot
for i, category in enumerate(primary_categories):
# Filter the data for the current brand
category_data = filtered_df[filtered_df['brand_name'] == category]
# Resample by month and calculate mean and standard deviation of ratings
monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
# Plot on the i-th subplot
axes[i].plot(monthly_sentiment.index, monthly_sentiment['rating_mean'], label='Mean Rating', color='blue')
axes[i].fill_between(monthly_sentiment.index,
monthly_sentiment['rating_mean'] - monthly_sentiment['rating_std'],
monthly_sentiment['rating_mean'] + monthly_sentiment['rating_std'],
color='blue', alpha=0.2, label='Standard Deviation')
axes[i].set_title(f'Customer Sentiment Over Time - {category}')
axes[i].set_xlabel('Time')
axes[i].set_ylabel('Rating')
axes[i].legend()
# Hide any empty subplots if the number of categories is odd
for j in range(i + 1, len(axes)):
axes[j].axis('off')
plt.tight_layout()
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\3119811863.py:12: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
Analysis of Brand Trends: (GPT facilitate writing)
Stable Brands (e.g., TAN-LUXE, Isle of Paradise): These brands show relatively stable ratings over time with minor fluctuations.
Declining Brands (e.g., Supergoop!, Glossier): Ratings show a downward trend, indicating decreasing customer satisfaction.
GPT polished
From the analysis above we can see that:
Strength of the combination of the datasets:
The integration of timestamped ratings with brand and category data allows for a comprehensive analysis of rating trends over time across various product segments. Without this combined dataset, it would be challenging to observe patterns specific to brands or categories, limiting our ability to generate actionable insights based on time-related trends. This integrated approach reveals valuable insights into customer sentiment shifts and informs targeted strategies for different brands and categories.
For this part we take rating as the criteria of customer experiences, indicating them as being positive/negative towards the certain product.
# code generation assisted by copilot
# List of ingredients to analyze
ingredients_to_analyze = ['Glycerin', 'Phenoxyethanol', 'Caprylyl Glycol', 'Tocopherol', 'Limonene',
'Ethylhexylglycerin', 'Dimethicone', 'Linalool', 'Silica', 'Butylene Glycol',
'Citric Acid', 'Mica', 'Potassium Sorbate', 'Water', 'Caprylic/Capric Triglyceride',
'Sodium Benzoate', 'Citronellol', 'Tocopheryl Acetate', 'Propanediol', 'Geraniol']
# Dictionary to store descriptive statistics
ingredient_stats = {}
# Calculate mean, variance, and standard deviation for each ingredient
for ingredient in ingredients_to_analyze:
column_name = f'contains_{ingredient}'
if column_name in Sephora_df.columns:
ratings = Sephora_df[Sephora_df[column_name] == True]['rating']
mean_rating = ratings.mean()
variance_rating = ratings.var()
std_dev_rating = ratings.std()
ingredient_stats[ingredient] = {
'mean': mean_rating,
'variance': variance_rating,
'std_dev': std_dev_rating
}
# Convert the dictionary to a DataFrame for better visualization
ingredient_stats_df = pd.DataFrame(ingredient_stats).T
ingredient_stats_df.columns = ['Mean Rating', 'Variance', 'Standard Deviation']
# Display the result
print(ingredient_stats_df.sort_values('Mean Rating', ascending=False))
Mean Rating Variance Standard Deviation Geraniol 4.396920 1.102381 1.049943 Citronellol 4.385853 1.151699 1.073172 Limonene 4.374090 1.190390 1.091050 Linalool 4.336357 1.260056 1.122522 Citric Acid 4.312361 1.250098 1.118078 Caprylic/Capric Triglyceride 4.301079 1.354193 1.163698 Sodium Benzoate 4.297634 1.284924 1.133545 Propanediol 4.296871 1.301729 1.140934 Caprylyl Glycol 4.296258 1.269184 1.126581 Glycerin 4.295407 1.306163 1.142875 Tocopherol 4.294736 1.320070 1.148943 Potassium Sorbate 4.294610 1.246196 1.116331 Dimethicone 4.291479 1.262378 1.123556 Tocopheryl Acetate 4.287476 1.240941 1.113975 Butylene Glycol 4.285907 1.288419 1.135085 Ethylhexylglycerin 4.279360 1.336498 1.156070 Silica 4.278348 1.346423 1.160355 Phenoxyethanol 4.261746 1.359882 1.166140 Water 4.230556 1.431122 1.196295 Mica 4.208789 1.521640 1.233548
Above displayed are the mean, variance and std values of sentiment ratings for products containing different ingredients.
# code generation assisted by copilot
from scipy.stats import chi2_contingency, ttest_ind, f_oneway
# Chi-squared test for categorical data
def chi_squared_test(df, ingredient):
contingency_table = pd.crosstab(df[ingredient], df['rating'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
return chi2, p
# T-test for comparing means between two groups
def t_test(df, ingredient):
group_with = df[df[ingredient] == True]['rating']
group_without = df[df[ingredient] == False]['rating']
t_stat, p = ttest_ind(group_with, group_without, nan_policy='omit')
return t_stat, p
# ANOVA for comparing means across multiple groups
def anova_test(df, ingredient):
group_with = df[df[ingredient] == True]['rating']
group_without = df[df[ingredient] == False]['rating']
f_stat, p = f_oneway(group_with, group_without)
return f_stat, p
# List of ingredients to analyze
ingredients_to_analyze = ['contains_Glycerin', 'contains_Phenoxyethanol', 'contains_Caprylyl Glycol',
'contains_Tocopherol', 'contains_Limonene', 'contains_Ethylhexylglycerin',
'contains_Dimethicone', 'contains_Linalool', 'contains_Silica',
'contains_Butylene Glycol', 'contains_Citric Acid', 'contains_Mica',
'contains_Potassium Sorbate', 'contains_Water', 'contains_Caprylic/Capric Triglyceride',
'contains_Sodium Benzoate', 'contains_Citronellol', 'contains_Tocopheryl Acetate',
'contains_Propanediol', 'contains_Geraniol']
# Dictionary to store test results
test_results = {}
# Perform tests for each ingredient
for ingredient in ingredients_to_analyze:
chi2, chi2_p = chi_squared_test(Sephora_df, ingredient)
t_stat, t_p = t_test(Sephora_df, ingredient)
f_stat, f_p = anova_test(Sephora_df, ingredient)
test_results[ingredient] = {
'chi2_p': chi2_p,
't_p': t_p,
'anova_p': f_p
}
# Convert the results to a DataFrame for better visualization
test_results_df = pd.DataFrame(test_results).T
test_results_df.columns = ['Chi-squared p-value', 'T-test p-value', 'ANOVA p-value']
# Display the results
print(test_results_df)
Chi-squared p-value T-test p-value \
contains_Glycerin 2.238161e-154 8.434185e-07
contains_Phenoxyethanol 0.000000e+00 7.232323e-204
contains_Caprylyl Glycol 9.270977e-214 9.466178e-02
contains_Tocopherol 2.357263e-11 3.627108e-03
contains_Limonene 7.769743e-278 5.029388e-268
contains_Ethylhexylglycerin 9.687556e-60 9.779087e-31
contains_Dimethicone 2.237796e-222 4.070112e-04
contains_Linalool 4.279626e-43 3.621294e-41
contains_Silica 5.458571e-16 1.289459e-11
contains_Butylene Glycol 0.000000e+00 2.069692e-22
contains_Citric Acid 1.015159e-174 5.422440e-17
contains_Mica 1.138734e-172 2.051101e-125
contains_Potassium Sorbate 0.000000e+00 1.737248e-02
contains_Water 0.000000e+00 0.000000e+00
contains_Caprylic/Capric Triglyceride 3.457018e-99 3.146970e-01
contains_Sodium Benzoate 1.254051e-83 3.989864e-01
contains_Citronellol 1.648621e-144 8.661169e-146
contains_Tocopheryl Acetate 0.000000e+00 1.235899e-07
contains_Propanediol 6.316156e-40 1.419154e-01
contains_Geraniol 6.076203e-180 1.194250e-164
ANOVA p-value
contains_Glycerin 8.434185e-07
contains_Phenoxyethanol 7.232323e-204
contains_Caprylyl Glycol 9.466178e-02
contains_Tocopherol 3.627108e-03
contains_Limonene 5.029388e-268
contains_Ethylhexylglycerin 9.779087e-31
contains_Dimethicone 4.070112e-04
contains_Linalool 3.621294e-41
contains_Silica 1.289459e-11
contains_Butylene Glycol 2.069692e-22
contains_Citric Acid 5.422440e-17
contains_Mica 2.051101e-125
contains_Potassium Sorbate 1.737248e-02
contains_Water 0.000000e+00
contains_Caprylic/Capric Triglyceride 3.146970e-01
contains_Sodium Benzoate 3.989864e-01
contains_Citronellol 8.661169e-146
contains_Tocopheryl Acetate 1.235899e-07
contains_Propanediol 1.419154e-01
contains_Geraniol 1.194250e-164
interpretation assisted by GPT-4o
Interpretation of result:
Criteria for Statistical Significance:
Key Findings:
contains_Glycerin: contains_Phenoxyethanol:contains_Limonene:Ingredients with Mixed Significance:
contains_Caprylyl Glycol:contains_Tocopherol:Ingredients with No Significant Effect:
contains_Caprylic/Capric Triglyceride:Summary:
Conclusion
This analysis confirms that certain ingredients (e.g., Glycerin, Phenoxyethanol, Limonene) are significantly associated with differences in customer experience, either positively or negatively, supporting their influence on sentiment. However, not all ingredients have a notable impact, highlighting the importance of ingredient selection in product formulation for customer satisfaction.
# code generation assisted by copilot
# Define the ingredient combinations to analyze
ingredient_combinations = [
['contains_Glycerin', 'contains_Phenoxyethanol'],
['contains_Caprylyl Glycol', 'contains_Tocopherol'],
['contains_Limonene', 'contains_Ethylhexylglycerin'],
['contains_Dimethicone', 'contains_Linalool'],
['contains_Silica', 'contains_Butylene Glycol']
]
# Dictionary to store the results
combination_results = {}
# Loop through each combination and perform the analysis
for combination in ingredient_combinations:
# Filter the dataframe for products containing the specific combination of ingredients
filtered_df = Sephora_df
for ingredient in combination:
filtered_df = filtered_df[filtered_df[ingredient] == True]
# Calculate the mean rating for the combination
mean_rating = filtered_df['rating'].mean()
# Perform ANOVA to test for significant differences
groups = [Sephora_df[Sephora_df[ingredient] == True]['rating'] for ingredient in combination]
f_stat, p_value = f_oneway(*groups)
# Store the results
combination_results[tuple(combination)] = {
'mean_rating': mean_rating,
'f_stat': f_stat,
'p_value': p_value
}
# Convert the results to a DataFrame for better visualization
combination_results_df = pd.DataFrame(combination_results).T
combination_results_df.columns = ['Mean Rating', 'F-statistic', 'P-value']
# Display the results
print(combination_results_df)
Mean Rating \
contains_Glycerin contains_Phenoxyethanol 4.273820
contains_Caprylyl Glycol contains_Tocopherol 4.299253
contains_Limonene contains_Ethylhexylglycerin 4.314862
contains_Dimethicone contains_Linalool 4.312045
contains_Silica contains_Butylene Glycol 4.356550
F-statistic \
contains_Glycerin contains_Phenoxyethanol 250.416987
contains_Caprylyl Glycol contains_Tocopherol 0.304765
contains_Limonene contains_Ethylhexylglycerin 933.539286
contains_Dimethicone contains_Linalool 142.575725
contains_Silica contains_Butylene Glycol 4.230261
P-value
contains_Glycerin contains_Phenoxyethanol 2.133692e-56
contains_Caprylyl Glycol contains_Tocopherol 5.809106e-01
contains_Limonene contains_Ethylhexylglycerin 7.490995e-205
contains_Dimethicone contains_Linalool 7.379258e-33
contains_Silica contains_Butylene Glycol 3.970979e-02
analysis generation assisted by copilot
Analysis of Ingredient Combinations and Their Impact on Customer Sentiment
To identify ingredient combinations associated with notably positive or negative experiences, providing insights into their impact on customer sentiment.
We analyzed the mean rating, F-statistic, and p-value for various ingredient combinations to determine their significance and impact on customer sentiment.
Combination: contains_Glycerin and contains_Phenoxyethanol
Combination: contains_Caprylyl Glycol and contains_Tocopherol
Combination: contains_Limonene and contains_Ethylhexylglycerin
Combination: contains_Dimethicone and contains_Linalool
Combination: contains_Silica and contains_Butylene Glycol
The analysis reveals that certain ingredient combinations, such as contains_Glycerin and contains_Phenoxyethanol, contains_Limonene and contains_Ethylhexylglycerin, and contains_Dimethicone and contains_Linalool, are notably associated with positive customer experiences. These combinations have high mean ratings and statistically significant p-values, indicating their strong impact on customer sentiment. On the other hand, combinations like contains_Caprylyl Glycol and contains_Tocopherol show high mean ratings but lack statistical significance, suggesting a less direct impact on sentiment.
These insights can guide product formulation and marketing strategies to enhance customer satisfaction by focusing on effective ingredient combinations.
summary synthesis assisted by copilot
contains_Glycerin and contains_Phenoxyethanol, contains_Limonene and contains_Ethylhexylglycerin, and contains_Dimethicone and contains_Linalool are associated with positive customer experiences.contains_Caprylyl Glycol and contains_Tocopherol show high mean ratings but lack statistical significance, suggesting a less direct impact on sentiment.mean, var, and std information of all the skin-care products specifically. Without combination, only mean value is present in the product dataframe, while in the review dataframe no exact information about the specific skin-care category exists.With the common sense that different skin types may require different kinds of cosmetics, we look into what are the preference and effectiveness differences between people with different skin types.
First we see how many types of skin are there listed in the dataset:
Sephora_df['skin_type'].value_counts()
skin_type combination 544513 dry 185937 normal 131910 oily 120494 Not specified 111557 Name: count, dtype: int64
To better see the number of occurrences for each type, we can apply the visualziation tool as follows.
# prompt to copilot: use plotting tools to visualize the number occurrences of each skin types
import seaborn as sns
import matplotlib.pyplot as plt
# Count the occurrences of each skin type
skin_type_counts = Sephora_df['skin_type'].value_counts()
# Plot the data
plt.figure(figsize=(10, 6))
sns.barplot(x=skin_type_counts.index, y=skin_type_counts.values, palette='viridis')
plt.title('Number of Occurrences of Each Skin Type')
plt.xlabel('Skin Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\154163572.py:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=skin_type_counts.index, y=skin_type_counts.values, palette='viridis')
From the figure we can see that the combination type has the most number of occurrences, with all other types quite similar to each other in number.
In this part we study the impact of Glycerin contain on product helpfulness across different categories of skin types.
To study the differences between two contrast types of skins, we look specifically into the differences of helpfulness towards dry and oily types of skins, for products with/without certain ingredients. First we nnsure that skin_type and contains_Glycerin in Sephora_df are categorical variables so we can perform the ANOVA analysis.
# code generation and interpretation assisted by ChatGPT-4o
Sephora_df['skin_type'] = Sephora_df['skin_type'].astype('category')
Sephora_df['contains_Glycerin'] = Sephora_df['contains_Glycerin'].astype('category')
We can use a two-way ANOVA model to analyze if the mean helpfulness differs between skin_type and contains_Glycerin, and if there is an interaction effect between these two factors.
from statsmodels.formula.api import ols
model = ols('helpfulness ~ C(skin_type) * C(contains_Glycerin)', data=Sephora_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
sum_sq df F \
C(skin_type) 299.254643 4.0 381.820830
C(contains_Glycerin) 6.787471 1.0 34.640704
C(skin_type):C(contains_Glycerin) 28.539878 4.0 36.414204
Residual 214436.036012 1094401.0 NaN
PR(>F)
C(skin_type) 0.000000e+00
C(contains_Glycerin) 3.966425e-09
C(skin_type):C(contains_Glycerin) 1.742991e-30
Residual NaN
Based on the ANOVA result, we can get:
Effect of skin_type
C(skin_type) is 0, which is less than 0.05, indicating a statistically significant effect. This suggests that there is a significant difference in the mean helpfulness score across different skin types (Dry, Oily, etc.).Effect of contains_Glycerin
C(contains_Glycerin) is also very small (much less than 0.05), indicating a statistically significant effect. This suggests that whether a product contains glycerin significantly affects the helpfulness score, regardless of skin type.Interaction Effect between skin_type and contains_Glycerin
C(skin_type):C(contains_Glycerin) has a p-value close to zero, which is highly significant. This indicates that the effect of glycerin content on helpfulness varies by skin type. In other words, the impact of glycerin on perceived helpfulness is different for different skin types.skin_type and contains_Glycerin independently influence the mean helpfulness.helpfulness depends on skin type.These results suggest that when evaluating product helpfulness, it's important to consider both the user's skin type and whether the product contains glycerin, as their interaction plays a significant role.
To observe the mean differences across groups more clearly, we can use seaborn to create a grouped box plot:
sns.boxplot(x='contains_Glycerin', y='helpfulness', hue='skin_type', data=Sephora_df)
plt.title('Comparison of Helpfulness by Skin Type and Glycerin Content')
plt.show()
same as above in the summary part
skin_type and contains_Glycerin independently influence the mean helpfulness.helpfulness depends on skin type.These results suggest that when evaluating product helpfulness, it's important to consider both the user's skin type and whether the product contains glycerin, as their interaction plays a significant role.
skin_type and helpfulness comment from review dataset and ingredient information from product dataset. This makes it possible for us to analyze how/whether certain ingredient contribute to customers' feeling with the products.In the process of analyzing this dataset, we gained valuable insights into handling various data types, dealing with missing values, and selecting appropriate analytical and visualization methods. Each step, from initial data understanding to more targeted analyses, required careful consideration to extract meaningful insights aligned with our research objectives.
Handling missing values was one of the initial challenges. We approached this by assessing the context and importance of each variable. For numerical data, missing values were typically imputed with the mean or median when appropriate, allowing for continuity in statistical analyses without distorting data distributions. For categorical and boolean data, missing values were either filled with the mode or retained as a distinct category if they represented meaningful absence (e.g., indicating the non-availability of a feature). This strategy ensured that missing values didn’t bias the analysis while still retaining as much information as possible.
The dataset contained a variety of data types, each requiring distinct handling and analysis techniques. Numerical data, such as product prices and sizes, was analyzed using descriptive statistics to understand distributions and general trends. For categorical data like product categories and brands, we employed grouping and pivot tables to uncover rating trends across different classes. Boolean variables, which indicated the presence or absence of certain features, were analyzed using ANOVA tests and visualized with violin plots to capture differences in ratings. Text data are processed using advanced techniques like sentiment analysis or text categorization.
Selecting the right analytical and visualization methods was essential to effectively interpret the data. We began with exploratory data analysis to understand the basic characteristics and patterns, for example using histograms for numerical variables. When examining relationships between variables relevant to the research questions, we used scatter plots for continuous relationships and hexbin plots for dense data, as they provided clearer insights than traditional scatter plots. For categorical comparisons, violin plots and box plots were highly effective in displaying distributional differences across categories.
The process involved several stages, starting with understanding the data and conducting basic exploratory analyses to get an overview of trends. We then moved to more specific analyses based on the research questions, focusing on key features that were likely to impact product ratings. This approach allowed us to build a logical flow, moving from broad insights to specific findings relevant to our objectives.
Throughout the analysis, we encountered a few challenges, particularly in selecting the most appropriate methods for visualization and handling the diverse data types. For instance, deciding on how to treat missing values without introducing bias was complex, as different variables required different approaches. The variety of data types also made it necessary to use multiple analytical techniques, which required constant adjustment and validation. Despite these challenges, we found that iterating on the analysis methods and continually aligning them with the research questions helped produce clearer and more actionable insights. This project underscored the importance of flexibility and methodological rigor when working with real-world data.
Note: GPT polished.